Hi,
I'm used to be able to use the following in PHP. What is basically does
is: return me all matches, including the captures, order by matching set
and provide me the offsets.
$ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);'
array(2) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_foo_"
[1]=>
int(0)
}
[1]=>
array(2) {
[0]=>
string(3) "foo"
[1]=>
int(1)
}
}
[1]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_bar_"
[1]=>
int(6)
}
[1]=>
array(2) {
[0]=>
string(3) "bar"
[1]=>
int(7)
}
}
}
I've found two ways in ruby getting in this direction, either use
String#match or String#scan, but both only provide me partial
information. I guess I can combine the knowledge of both, but before
attempting this I wanted to verify if I didn't overlook something. Here
are my ruby attempts:
ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/)
=> #<MatchData "_foo_" 1:"foo">
ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
=> ["_foo_", "foo"]
ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
=> [0, 1]
But here I'm missing the further possible matches, "_bar_" and "bar". Or
the #scan approach:
ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
=> [["foo"], ["bar"]]
But in this case I've even less information, the match including _foo_
or _bar_ is not present and I can't get the offsets too.
I re-read the documentation for Regexp#match and found out that you can
pass an offset into the string as second parameter, so I guess I can
iterate over the string in a loop until I find no further matches ...?
Considering this I came up with:
$ cat test_match_all.rb
require 'pp'
class String
def match_all(pattern)
matches = []
offset = 0
while m = match(pattern, offset) do
matches << m
offset = m.begin(0) + m[0].length
end
matches
end
end
pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/)
$ ruby test_match_all.rb
[#<MatchData "_foo_" 1:"foo">,
#<MatchData "_bar_" 1:"bar">,
#<MatchData "_baz_" 1:"baz">]
I've lots of data to parse so I could foresee that this approach can
become a bottleneck. Is there a more direct solution to it?
thanks,
- Markus