Regex Question

Seijin · 24 July 2007 16:50

I recently started studying Ruby (1.8.6) and things are going well.

However, something bothered me when I got to the regular expression
section. This is my first time deal with regular expressions directly
so I'm completely new to things. I understand the idea behind regular
expressions and most of the usage but two of them stuck out for me.

I'm studying the 2nd edition Pragmatic book and using their
show_regexp method...

def show_regexp(a, re)
   if a =~ re
      "#{$`}<<#{$&}>>#{$'}"
   else
      "no match"
   end
end

...hopefully I don't get in trouble posting that method here. So here
are my two problems...

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Thanks a lot!

Rob_Biedenharn1 · 24 July 2007 17:27

I recently started studying Ruby (1.8.6) and things are going well.

However, something bothered me when I got to the regular expression
section. This is my first time deal with regular expressions directly
so I'm completely new to things. I understand the idea behind regular
expressions and most of the usage but two of them stuck out for me.

I'm studying the 2nd edition Pragmatic book and using their
show_regexp method...

def show_regexp(a, re)
   if a =~ re
      "#{$`}<<#{$&}>>#{$'}"
   else
      "no match"
   end
end

...hopefully I don't get in trouble posting that method here. So here
are my two problems...

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

Well, it would but the * is 'zero or more times' and it can match zero "an"s before it ever needs to get past the 'b'. If you change the '*' to a '+' and force 'one or more times', then it has to find the first "an" and will then greedily consume the next "an" also.

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Thanks a lot!

The '+' is greedy so I think it tries \w+ as Mississippi and finds that there's no following copy, backs up to try Mississipp and finds that 'i' isn't a copy, and so on back to 'M' before moving past 'M' to 'ississippi', 'ississipp', 'ississip', etc. until 'iss' has a following copy and a match is declared. Think of it this way: Until the regexp gets to the end, it doesn't know that the string isn't "MississippiMississipp".

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Jul 24, 2007, at 12:50 PM, seijin@gmail.com wrote:

Bill_Kelly · 24 July 2007 17:30

def show_regexp(a, re)
  if a =~ re
     "#{$`}<<#{$&}>>#{$'}"
  else
     "no match"
  end
end

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

It is greedy, but it's giving you the *first* match it
can find. Since it's allowed to match zero-or-more times
(because of the *), it's finding a match at the beginning
of the string. From that standpoint, as greedy as it can
be is to match zero characters.

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Essentially it scans the string for the first match it can find
that satisfies the expression.

Maybe it's like trying all possible locks that fit the key.

I don't know for sure what order it tries the combinations, but
it's probably something like:

  MM
  MiMi
  MisMis
  MissMiss
  ...
  MississippiMississippi
  <no match, so advance to next character in string>
  ii
  isis
  ississ
  <match found>

Hope this helps,

Bill

···

From: <seijin@gmail.com>

Martin_DeMello · 24 July 2007 17:33

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

It does, but it also returns the *first* available match, which in the
case of 'banana' is the empty string. Note that * matches 0 or more
iterations. + will work as expected:

show_regexp("banana", /(an)+/)

=> "b<<anan>>a"

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

(\w+)\1 matches one or more letters, twice in a row. The empty string
doesn't match the "one or more bit". This is just the opposite case to
your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:

show_regexp("Mississippi", /(\w*)\1/)

=> "<<>>Mississippi"

show_regexp("MississippiMississippi", /(\w*)\1/)

=> "<<MississippiMississippi>>"

show_regexp("bbMississippiMississippi", /(\w*)\1/)

=> "<<bb>>MississippiMississippi"

martin

···

On 7/24/07, seijin@gmail.com <seijin@gmail.com> wrote:

Martin_DeMello · 24 July 2007 17:35

your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:

Oops - Rob is right, of course - + being greedy, it tries \w+ =
Mississippi first, then shrinks from there:

show_regexp("MiMississippiMiMississippi", /(\w*)\1/)

=> "<<MiMississippiMiMississippi>>"

martin

···

On 7/24/07, Martin DeMello <martindemello@gmail.com> wrote:

Seijin · 24 July 2007 19:05

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ? It's just confusing because I think of
"greedy" as it trying to match as many characters as possible. So
that although "*" tell it to find 0 or more that it would work hard to
return the maximum number of matches. Perhaps the difference between
"Find as many a's as possible" and "Find as many a's or nothing as
possible" ? If you see what I'm trying to say. No, I think I just
figured out what my problem is. I'm taking it out of context. The
"*" was created for use other arguments. I keep getting stuck on
using it with just one argument which is not what it was intended for.

And for Mississippi - is that method of matching only used when using
back references via the \1-9 ? If so, I can set my mind at ease as
that makes sense. The thing that was bothering me was that I didn't
see why it would work so hard to find the match. I thought it was a
simple... (\w+) matches "Mississippi" but there is no
"MississippiMississippi" so I'll return nil... type of thing. I
didn't realize it'd start scanning the string aggresively.

Thanks for the reply!

···

On Jul 24, 10:27 am, Rob Biedenharn <R...@AgileConsultingLLC.com> wrote:

On Jul 24, 2007, at 12:50 PM, sei...@gmail.com wrote:

> I recently started studying Ruby (1.8.6) and things are going well.

> However, something bothered me when I got to the regular expression
> section. This is my first time deal with regular expressions directly
> so I'm completely new to things. I understand the idea behind regular
> expressions and most of the usage but two of them stuck out for me.

> I'm studying the 2nd edition Pragmatic book and using their
> show_regexp method...

> def show_regexp(a, re)
> if a =~ re
> "#{$`}<<#{$&}>>#{$'}"
> else
> "no match"
> end
> end

> ...hopefully I don't get in trouble posting that method here. So here
> are my two problems...

> First, why does ... show_regexp("banana", /(an)*/) ... not match
> "anan" ? I thought it was a greedy algorithm that tried to match as
> much as possible?

Well, it would but the * is 'zero or more times' and it can match
zero "an"s before it ever needs to get past the 'b'. If you change
the '*' to a '+' and force 'one or more times', then it has to find
the first "an" and will then greedily consume the next "an" also.

> Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> Why in the world does it match "ississ" rather than returning no
> match? I think most of my problem with this one is not understanding
> the underlying logic used when doing pattern matching with back
> references. Does it go through and check "Mississippi", "Mississipp"
> down to "M" and then "i", "is", "iss", etc? Like trying all possible
> combinations in a lock? I would be extremely grateful if someone
> would do a short step-by-step of how it matches "ississ". I
> understand what the "\1" does, I just don't understand how the first
> part even gets to the first "iss".

> Thanks a lot!

The '+' is greedy so I think it tries \w+ as Mississippi and finds
that there's no following copy, backs up to try Mississipp and finds
that 'i' isn't a copy, and so on back to 'M' before moving past 'M'
to 'ississippi', 'ississipp', 'ississip', etc. until 'iss' has a
following copy and a match is declared. Think of it this way: Until
the regexp gets to the end, it doesn't know that the string isn't
"MississippiMississipp".

-Rob

Rob Biedenharn http://agileconsultingllc.com
R...@AgileConsultingLLC.com- Hide quoted text -

- Show quoted text -

Seebs · 24 July 2007 19:08

Both will match.

The thing to understand is that /X*/ always matches successfully
no matter whether X is anchored or not, is long or a single character,
or what, simply because it can always match the empty string.

If it matches the beginning of a word, it will match greedily, but
it won't go looking for a longer match later instead of taking the first
match it finds.

-s

···

In message <1185303847.407283.279370@x35g2000prf.googlegroups.com>, seijin@gmai l.com writes:

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ?

David_A_Black3 · 24 July 2007 20:15

Hi --

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ? It's just confusing because I think of
"greedy" as it trying to match as many characters as possible. So
that although "*" tell it to find 0 or more that it would work hard to
return the maximum number of matches.

Keep in mind, though, that it's looking for a pattern-match starting
at the left. If it finds it, it's finished; it's not going to keep
scanning the string. In the leftmost position, it finds a match for
a* -- namely, 0 occurrences of a. So it's done.

Here's an example that might help clarify it:

"xxyyxxxyyy" =~ /x*/ # 0

The longest stretch of x's is at position 5. However, the job of the
match engine is not to find the maximum number of x's; it's to find
the first match for /x*/, starting at the left. It finds that match
at position 0. There are only two x's, but that's neither here nor
there.

David

···

On Wed, 25 Jul 2007, seijin@gmail.com wrote:

--
* Books:
   RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
   RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
     & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

Topic		Replies	Views
Regex query ruby-talk	2	113	10 January 2007
Questions about * + and ? in Regex ruby-talk	3	133	31 December 2007
Problem with getting RegEx to work ruby-talk	3	154	17 September 2008
RegExp Problem ruby-talk	17	94	25 May 2006
Regexp Matching Not Blank And Not A Specific Word? ruby-talk	6	113	14 December 2007

Regex Question

Related topics