Phwew… regexp now supports unicode
download as TGZ:
http://rubyforge.org/frs/download.php/706/regexp-engine-0.11.tar.gz
download as ZIP:
http://rubyforge.org/frs/download.php/707/regexp-engine-0.11.zip
demo site:
http://neoneye.dk/regexp.rbx
Overview
···
========
Regexp engine is written entirely in Ruby. It is very compatible
with Ruby’s builtin regexp-engine. Carefully tested (+2000 tests).
Features
There is at the moment 3 parsers… perl5, perl6, xml.
Encodings supported: ASCII, UTF8.
Not yet supported stuff
Send me a mail in case there are something you want, or if
you are a developer yourself then send me some patches.
- subcaptures inside negative-lookahead/behind.
- grammars.
- UTF16 and other encoding.
- inline-code.
- named captures.
- possesive quantifiers.
- recursive expression.
Perl5 syntax
a>b>c alternation
[…] [^…] character class… and inverse charclass
[[:alpha:]] posix character class
[[:^alpha:]] inverse posix character class
. dot matches anything except newline, same as [^\n]
\1 … \9 backreference . . . . . . . . . . . . . . . . . . . . . . see [3]
-
*? loop 0 or more times greedy/lazy
-
+? loop 1 or more times greedy/lazy
{n,} {n,}? loop n or more times greedy/lazy
? ?? loop 0…1 times greedy/lazy
{n,m} {n,m}? loop n…m times greedy/lazy
{n} {n}? loop n times greedy/lazy
( … ) capturing group
(?: … ) non-capturing group
(?> … ) atomic grouping
(?= … ) positive-lookahead
(?! … ) negative-lookahead . . . . . . . . . . . . . . . . . . . see [2]
(?<= … ) positive-lookbehind . . . . . . . . . . . . . . . . . . . see [1]
(?<! … ) negative-lookbehind . . . . . . . . . . . . . . . . . . . see [1], [2]
(?# … ) posix-comment
(?i) (?-i) ignorecase on/off
(?m) (?-m) multiline on/off
(?x) (?-x) extended on/off
^ \A begin of line, begin of string
$ \z \Z end of line, end of string (excl newline)
\b \B word boundary, nonword boundary
\d \D [[:digit:]] and the inverse [[1]]
\s \S [[:space:]] and the inverse [[2]]
\w \W [[:word:]] and the inverse [[3]]
\x20 hex . . . . . . . . . . . . . . . . . . . . . . . . . . . see [4]
\040 octal . . . . . . . . . . . . . . . . . . . . . . . . . . see [3], [4]
\x{deadbeef} widechar codepoint specified as hex
\n newline
\a bell
\ escape next char
precedens between operators:
() pattern memory
-
- ? {} number of occurrences
^ $ \b \B pattern anchors
- ? {} number of occurrences
alternatives
-
Variable-width-lookbehind are fairly supported by this engine.
For instance this (?<=(a.*)g) is a valid expression.
Beware that the left-most-longest rule is inversed inside lookbehind,
and that Backreferences are not possible (yet). -
Subcaptures inside negative-lookahead/behind are empty
at the moment. -
If one tries to backreference a not-existing capture then it
will be interpreted as an octal symbol. -
When encoding is ASCII, you can specify hex/octal values in
the range 0-255. However when encoding is UTF8 then only the
range 0-127 are valid, in this case the range 128-255 is undefined.
Call For Help
etablish contact, if you have interest in perl6 regexp.
etablish contact, if you have knowledge about asian text-encodings.
–
Simon Strandgaard