Wow, that's pretty cool, actually. I like it. A couple of criticism
(meant as constructive):
Thanks Caleb. Your criticisms and comments were indeed constructive
and very welcome. Much appreciated.
I'm a little concerned about what will happen with this regexp:
/(?:foo){1,10000}/. Currently, Regexp handles this just fine, but with
your lib I fear there might be a performance problem.
One problem I saw with large repetitions under 1.8.x is exceeding the
maximum number of allowed grouped sub-expressions, which is hard set
at >= 254 (or ((1<<8)-1) to quote the code). So the maximum number of
grouped repetitions under 1.8 will always be 253. I initially thought
it was a buffer exhaustion issue, but after checking, it turns out
there's room for up to 64K under 1.8. In the case of
/(?:foo){1,10000)/, the group limit was being reached when the regex
was only at about 1.7K. With 1.9.x there doesn't appear to be a limit,
or at least I didn't hit it. I stopped trying at 250_000. Yet another
great reason to move to 1.9.
As for performance, I haven't done any testing, but expect expressions
with greedy parts that require a lot of backtracking (already poorer
performers with a potential for exponential time) to get worse when
they get expanded. A benchmark is in order for sure. I would like to
also gauge how helpful atomic groups (?>) can be in situations like
this.
Your Regexp parser looks to me to be a little naive. I gave it only
the briefest of scannings, but it looks like you're not handling
special characters escaped by [ and ], nor some of the other types of
ways of using backslashes, like \C-x, for example. I'm not entirely
clear what happens if there are two or more backslashes in a row. (A
character is only escaped by backslashes if preceded by an odd number
of them.)
You are right, the parser is naive and incomplete. It doesn't handle
several cases, including character classes. I didn't even consider
other escape sequences like control/meta characters or even \x hex
values. Thanks for pointing that omission out.
I'm not sure about the backslashes. What would you expect from this:
re = MetaRegexp.new '\d\\d\\\d\\\\d'
=> /\d\d\\d\\d/
The standard Regexp produces the same results:
re = Regexp.new '\d\\d\\\d\\\\d'
=> /\d\d\\d\\d/
Am I missing something? I will double check.
I wrote a more complete (at least I hope so) regexp parser (but also
specialized for my purposes) as part of my lib sequence. See the
group_anchors method in this file:
sequence/lib/sequence/stringlike.rb at master · coatl/sequence · GitHub
It would be nice to have a Regexp parser as a standalone library, actually.
Cool. Thanks for sharing that. Even after a quick read I learned a
couple of interesting things. I like the way you use split and the
handling of the escape sequences.
Also, I really like the idea of a standalone regex parser. Hmmm...
very tempting.
···
On Sun, Oct 3, 2010 at 1:37 AM, Caleb Clausen <vikkous@gmail.com> wrote: