Do You Understand Regular Expressions?

Hi all.

I'm pretty new to Ruby and that sort of thing, and I'm having a few
problems understanding regular expressions. I'm hoping one of you can
point me in the right direction.

I want to replace an entire string with another string. I know you
don't need regular expressions for that, but it's part of a more
generic approach. Anyway, the problem I'm having is that my regular
expressions are finding two matches instead of one, and I don't
understand why. I've narrowed down my confusion to the following code,
which shows some output from irb:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

The same thing can be seen when substituting - this is closer to how
I'm using regular expressions in my code:

irb(main):001:0> "hello".gsub(/.*/, "P")
=> "PP"

Two substitutions are made and I expected one. So am I right or wrong
to expect one substitution?

Please help - this is driving me nuts!

And in case it helps...

$ ruby --version
ruby 1.8.5 (2006-08-25) [i486-linux]

Thanks in advance.

growlatoe@yahoo.co.uk wrote:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?
  

Try anchoring the match: /^.*/

···

--
RMagick OS X Installer [http://rubyforge.org/projects/rmagick/\]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?forum_id=1618\]
RMagick Installation FAQ [http://rmagick.rubyforge.org/install-faq.html\]

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

You can search for at least one occurrence like this:

"hello".scan(/.+/)

"hello".gsub(/.+/, "P") => 'P'

As an introduction, I find

quite instructive for the use of regexps in Ruby.

Best regards,

Axel

···

--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kanns mit allen: http://www.gmx.net/de/go/multimessenger

Axel Etzold wrote:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string. I know that zero occurrences is one match but after a greedy match that matches everything, there should be (logically?) no other match. I am no stranger to regexps and the result is counter-intuitive to me; I would consider it a bug. Or at least a very very peculiar behavior.

Daniel

"hello".scan(/..*/)
=> ["hello"]

···

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

Axel Etzold wrote:
>> irb(main):001:0> "hello".scan(/.*/)
>> => ["hello", ""]
>>
>> I was expecting one match, not two, because .* matches everything,
>> right? Can someone explain why an empty string is also matched?
>
> String.scan searches for all occurrences of (any number of any
> character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

Daniel DeLorme wrote:

Axel Etzold wrote:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>
puts "hello".gsub(/.*/, '<\1>') # <><>
print "before: #{$`}\n" # before: hello
print "match: #{$&}\n" # match:
print "after: #{$'}\n" # after:

puts "hello" =~ (/.*/) # 0
print "before: #{$`}\n" # before:
print "match: #{$&}\n" # match: hello
print "after: #{$'}\n" # after:

thanks!

···

--
Posted via http://www.ruby-forum.com/\.

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

/..*/ will match everything after something, this is a modified form
of the above that isn't tied to the start of the string

-- Stephen

···

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

At face value it is a surprising result, but upon a bit of
consideration
not illogical or faulty. The scan pattern finds first with greedy
matching
the "hello" string. As you've said, after that there is nothing to
match
anymore. But the pattern "/.*/" is a valid match on nothing also,
as * is zero or more occurances. For example:

irb(main):018:0> "hello".scan(/.*/)
=> ["hello", ""]
irb(main):019:0> "".scan(/.*/)
=> [""]

Compare this to using '+' which specifies there must be zero or more
occurances:

irb(main):020:0> "hello".scan(/.+/)
=> ["hello"]
irb(main):021:0> "".scan(/.+/)
=>

That's also why anchoring to the start of the string removes the
behaviour while
anchoring to end does not.

···

On Jun 21, 2:16 am, Daniel DeLorme <dan...@dan42.com> wrote:

Axel Etzold wrote:
>> irb(main):001:0> "hello".scan(/.*/)
>> => ["hello", ""]

>> I was expecting one match, not two, because .* matches everything,
>> right? Can someone explain why an empty string is also matched?

> String.scan searches for all occurrences of (any number of any
> character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Hello Ryan

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.

puts "hello".gsub(/.*/, '<\1>') # <><>

irb(main):029:0> "hello".gsub(/(.*)/, '<\1>')
=> "<hello><>"
irb(main):030:0> "hello".gsub(/(.+)/, '<\1>')
=> "<hello>"

print "before: #{$`}\n" # before: hello

irb(main):031:0> $`
=> ""

print "match: #{$&}\n" # match:

irb(main):032:0> $&
=> "hello"

print "after: #{$'}\n" # after:

irb(main):033:0> $'
=> ""

hope this helps.

regards.
Karl-Heinz

···

In message "Do You Understand Regular Expressions?" on 21.06.2007, Ryan Mcdonald <ryemcdonald@gmail.com> writes:

That still doesn't really explain why "hello".scan(/.*/) => ["hello", ""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "", "", ... ] since I (or rather the OP) could continue to match zero characters (bytes) at the end of the string forever? It does seem that it might be that a termination condition is checked a bit later than it should be in this case.

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote:

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
...
-- Stephen

Hi --

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

Here, again, "hello" is first, so /^.*/ matches it but doesn't match
the second time ("") because the "" isn't anchored to ^.

David

···

On Thu, 21 Jun 2007, Stephen Ball wrote:

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

--
* Books:
   RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
   RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
     & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

Why not simply change the 1 to a 0 ?

irb(main):001:0> puts "hello".gsub(/[aeiou]/, '<\0>')
h<e>ll<o>

···

On Jun 21, 4:43 am, Wild Karl-Heinz <kh.w...@wicom.li> wrote:

Hello Ryan

In message "Do You Understand Regular Expressions?" > on 21.06.2007, Ryan Mcdonald <ryemcdon...@gmail.com> writes:

> I agree. Can someone explain why gsub, sub or scan matches with * are
> different than =~ matches with *

> puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.

[snip]

> So: since * matches "zero or more" characters when it starts the
> search for .* it matches the absence (the 'zero') and then matches the
> string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

-- Stephen

···

On 6/21/07, dblack@wobblini.net <dblack@wobblini.net> wrote:

I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> "hello".scan /.*?/
=> ["", "", "", "", "", ""]

The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.

*This* behavour may be considered weird, or buggy, and probably results
are not what was expected.

But look at:

irb(main):038:0> "hello".scan /h(.*)e/
=> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=> [["", "", "", "", ""]]

Here 'nothing' matches many times, and definitely this *is* the expected
behaviour.

···

On 2007-06-21 23:12:32 +0900 (Thu, Jun), Rob Biedenharn wrote:

On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote:
>It's because the pattern /.*/ matches everything, including the
>absence of everything. Yes, with the proper regexs you can indeed have
>tea and no tea at the same time. Certainly peculiar, but occasionally
>useful.

That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

> It's because the pattern /.*/ matches everything, including the
> absence of everything.

> So: since * matches "zero or more" characters when it starts the
> search for .* it matches the absence (the 'zero') and then matches the
> string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there's nothing to match, then
you'll just match nothing?

Like this:

irb(main):001:0> "hello".scan(/h*/)
=> ["h", "", "", "", "", ""]

And this:

irb(main):002:0> "hello".scan(/P*/)
=> ["", "", "", "", "", ""]

I've always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd...

Thanks everyone for your help.

As far as I remember it works like this: first .* matches the whole sequence. Then the "cursor" is placed behind the match, i.e. after the last char of the match and matching starts again. At this place the empty sequence matches because we're at the end of the match. After that match the cursor is advanced one step (to avoid endless repetitions) and - alas! - we're at the end of the string and matching stops.

For learning regular expressions this is a great program: it allows to graphically step through the matching process:
http://weitz.de/regex-coach/

See also this thread: http://groups.google.de/group/comp.lang.ruby/browse_frm/thread/9bf7989dd42374f7/f759612390ff905f?lnk=st&q=&rnum=10#f759612390ff905f

Btw, for replacing the whole string this is much better:

irb(main):001:0> s = "foo"
=> "foo"
irb(main):002:0> s.object_id
=> 1073540760
irb(main):003:0> s.replace "bar"
=> "bar"
irb(main):004:0> s.object_id
=> 1073540760
irb(main):005:0> s
=> "bar"
irb(main):006:0>

Kind regards

  robert

···

On 21.06.2007 16:12, Rob Biedenharn wrote:

On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote:

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
...
-- Stephen

That still doesn't really explain why "hello".scan(/.*/) => ["hello", ""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "", "", ... ] since I (or rather the OP) could continue to match zero characters (bytes) at the end of the string forever? It does seem that it might be that a termination condition is checked a bit later than it should be in this case.

Hi --

···

On Fri, 22 Jun 2007, Stephen Ball wrote:

On 6/21/07, dblack@wobblini.net <dblack@wobblini.net> wrote:
[snip]

> So: since * matches "zero or more" characters when it starts the
> search for .* it matches the absence (the 'zero') and then matches the
> string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

Yes, that was what I was mostly going by :slight_smile:

David

--
* Books:
   RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
   RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
     & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

[...]

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

^ anchors the match to beginning of a line or the beginning of the
string. The second match fails because it's starting from the first
point after "hello", where it left off. It says nothing about the
content that follows.

"".scan /^.*/ => [""]

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

This is correct. First it finds the longest match it can in "hello".
Then it finds nothing, but still anchored at the end of the line. Note
that $ does not anchor the end of the string, but the end of each line
within the string or the very end. \z matches the actual end of
string, while \A does the same for the beginning.

Hope this helps.

···

On 6/21/07, Stephen Ball <sdball@gmail.com> wrote:

--
Sami Samhuri

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> "hello".scan /.*?/
=> ["", "", "", "", "", ""]

The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.

*This* behavour may be considered weird, or buggy, and probably results
are not what was expected.

A great example which I *do* consider to be buggy. The similar example from perl is something like:
$ perl -e '$h = "hello"; $h =~ s/.*?/[$&]/g; print "$h\n";'
[h][e][l][l][o]

It matches the empty string at the beginning, between each character, and at the end, but it does consume the actual characters of the string. Even if not what one would anticipate, it's not too hard to justify the result. (Something that can't be said for ruby's ["","","","","",""].)

The other versions from perl are enlightening:
$ perl -e '$h = "hello"; $h =~ s/.?/[$&]/g; print "$h\n";'
[h][e][l][l][o]

$ perl -e '$h = "hello"; $h =~ s/.*/[$&]/g; print "$h\n";'
[hello]

Both succeed in a zero-character match at the end. These are equivalent in ruby (1.8.5):

$ ruby -e 'puts "hello".scan(/.?/).inspect'
["h", "e", "l", "l", "o", ""]

$ ruby -e 'puts "hello".scan(/.*/).inspect'
["hello", ""]

I thought I'd see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:

require 'oniguruma'

=> true

reluctant = Oniguruma::ORegexp.new('.*?')

=> /.*?/

greedy = Oniguruma::ORegexp.new('.*')

=> /.*/

greedyq = Oniguruma::ORegexp.new('.?')

=> /.?/

reluctant.scan("hello")

=> [#<MatchData:0x10b9aa4>, #<MatchData:0x10b9a7c>, #<MatchData:0x10b9a68>, #<MatchData:0x10b9a40>, #<MatchData:0x10b9a18>, #<MatchData:0x10b99f0>]

reluctant.scan("hello").map{|md|md[0]}

=> ["", "", "", "", "", ""]

greedy.scan("hello").map{|md|md[0]}

=> ["hello", ""]

greedyq.scan("hello").map{|md|md[0]}

=> ["h", "e", "l", "l", "o", ""]

OK, the same result as the ruby Regexp. Including, that .*? produces [""]*6 which is the "before each character and at the end" locations of the zero-length matches from perl, but the individual single-byte matches are missing.

I presume that there's some justification for these behaviors, but I can't figure out what it might be.

-Rob

But look at:

irb(main):038:0> "hello".scan /h(.*)e/
=> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=> [["", "", "", "", ""]]

Here 'nothing' matches many times, and definitely this *is* the expected
behaviour.

I agree that those results are exactly what I'd expect.

--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Jun 22, 2007, at 6:55 AM, Mariusz Pękala wrote:

On 2007-06-21 23:12:32 +0900 (Thu, Jun), Rob Biedenharn wrote:

On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote:

".*" has its use but it's generally overrated, i.e. more often used than needed / wanted. If you show a more concrete example of what you are doing we might be able to come up with better suggestions. If you are really interested to dive into the matter then I suggest "Mastering Regular Expressions" which is an excellent book for the money.

Kind regards

  robert

···

On 22.06.2007 14:15, growlatoe@yahoo.co.uk wrote:

It's because the pattern /.*/ matches everything, including the
absence of everything.
So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there's nothing to match, then
you'll just match nothing?

Like this:

irb(main):001:0> "hello".scan(/h*/)
=> ["h", "", "", "", "", ""]

And this:

irb(main):002:0> "hello".scan(/P*/)
=> ["", "", "", "", "", ""]

I've always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd...