It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]
Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.
I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.
The interesting behaviour is:
irb(main):035:0> "hello".scan /.*?/
=> ["", "", "", "", "", ""]
The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.
*This* behavour may be considered weird, or buggy, and probably results
are not what was expected.
A great example which I *do* consider to be buggy. The similar example from perl is something like:
$ perl -e '$h = "hello"; $h =~ s/.*?/[$&]/g; print "$h\n";'
[h][e][l][l][o]
It matches the empty string at the beginning, between each character, and at the end, but it does consume the actual characters of the string. Even if not what one would anticipate, it's not too hard to justify the result. (Something that can't be said for ruby's ["","","","","",""].)
The other versions from perl are enlightening:
$ perl -e '$h = "hello"; $h =~ s/.?/[$&]/g; print "$h\n";'
[h][e][l][l][o]
$ perl -e '$h = "hello"; $h =~ s/.*/[$&]/g; print "$h\n";'
[hello]
Both succeed in a zero-character match at the end. These are equivalent in ruby (1.8.5):
$ ruby -e 'puts "hello".scan(/.?/).inspect'
["h", "e", "l", "l", "o", ""]
$ ruby -e 'puts "hello".scan(/.*/).inspect'
["hello", ""]
I thought I'd see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:
require 'oniguruma'
=> true
reluctant = Oniguruma::ORegexp.new('.*?')
=> /.*?/
greedy = Oniguruma::ORegexp.new('.*')
=> /.*/
greedyq = Oniguruma::ORegexp.new('.?')
=> /.?/
reluctant.scan("hello")
=> [#<MatchData:0x10b9aa4>, #<MatchData:0x10b9a7c>, #<MatchData:0x10b9a68>, #<MatchData:0x10b9a40>, #<MatchData:0x10b9a18>, #<MatchData:0x10b99f0>]
reluctant.scan("hello").map{|md|md[0]}
=> ["", "", "", "", "", ""]
greedy.scan("hello").map{|md|md[0]}
=> ["hello", ""]
greedyq.scan("hello").map{|md|md[0]}
=> ["h", "e", "l", "l", "o", ""]
OK, the same result as the ruby Regexp. Including, that .*? produces [""]*6 which is the "before each character and at the end" locations of the zero-length matches from perl, but the individual single-byte matches are missing.
I presume that there's some justification for these behaviors, but I can't figure out what it might be.
-Rob
But look at:
irb(main):038:0> "hello".scan /h(.*)e/
=> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=> [["", "", "", "", ""]]
Here 'nothing' matches many times, and definitely this *is* the expected
behaviour.
I agree that those results are exactly what I'd expect.
--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.
Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
···
On Jun 22, 2007, at 6:55 AM, Mariusz Pękala wrote:
On 2007-06-21 23:12:32 +0900 (Thu, Jun), Rob Biedenharn wrote:
On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote: