[ann] regexp-engine 0.11

Phwew… regexp now supports unicode :wink:

download as TGZ:
http://rubyforge.org/frs/download.php/706/regexp-engine-0.11.tar.gz

download as ZIP:
http://rubyforge.org/frs/download.php/707/regexp-engine-0.11.zip

demo site:
http://neoneye.dk/regexp.rbx

changelog:
http://rubyforge.org/cgi-bin/viewcvs/cgi/viewcvs.cgi/projects/regexp_engine/CHANGES?rev=1.72&cvsroot=aeditor&content-type=text/vnd.viewcvs-markup

Overview

···

========

Regexp engine is written entirely in Ruby. It is very compatible
with Ruby’s builtin regexp-engine. Carefully tested (+2000 tests).

Features

There is at the moment 3 parsers… perl5, perl6, xml.

Encodings supported: ASCII, UTF8.

Not yet supported stuff

Send me a mail in case there are something you want, or if
you are a developer yourself then send me some patches.

  • subcaptures inside negative-lookahead/behind.
  • grammars.
  • UTF16 and other encoding.
  • inline-code.
  • named captures.
  • possesive quantifiers.
  • recursive expression.

Perl5 syntax

a>b>c alternation
[…] [^…] character class… and inverse charclass
[[:alpha:]] posix character class
[[:^alpha:]] inverse posix character class
. dot matches anything except newline, same as [^\n]
\1 … \9 backreference . . . . . . . . . . . . . . . . . . . . . . see [3]

  • *?      loop 0 or more times  greedy/lazy
    
  • +?      loop 1 or more times  greedy/lazy
    

{n,} {n,}? loop n or more times greedy/lazy
? ?? loop 0…1 times greedy/lazy
{n,m} {n,m}? loop n…m times greedy/lazy
{n} {n}? loop n times greedy/lazy
( … ) capturing group
(?: … ) non-capturing group
(?> … ) atomic grouping
(?= … ) positive-lookahead
(?! … ) negative-lookahead . . . . . . . . . . . . . . . . . . . see [2]
(?<= … ) positive-lookbehind . . . . . . . . . . . . . . . . . . . see [1]
(?<! … ) negative-lookbehind . . . . . . . . . . . . . . . . . . . see [1], [2]
(?# … ) posix-comment
(?i) (?-i) ignorecase on/off
(?m) (?-m) multiline on/off
(?x) (?-x) extended on/off
^ \A begin of line, begin of string
$ \z \Z end of line, end of string (excl newline)
\b \B word boundary, nonword boundary
\d \D [[:digit:]] and the inverse [[1]]
\s \S [[:space:]] and the inverse [[2]]
\w \W [[:word:]] and the inverse [[3]]
\x20 hex . . . . . . . . . . . . . . . . . . . . . . . . . . . see [4]
\040 octal . . . . . . . . . . . . . . . . . . . . . . . . . . see [3], [4]
\x{deadbeef} widechar codepoint specified as hex
\n newline
\a bell
\ escape next char

precedens between operators:
() pattern memory

    • ? {} number of occurrences
      ^ $ \b \B pattern anchors
        alternatives
  1. Variable-width-lookbehind are fairly supported by this engine.
    For instance this (?<=(a.*)g) is a valid expression.
    Beware that the left-most-longest rule is inversed inside lookbehind,
    and that Backreferences are not possible (yet).

  2. Subcaptures inside negative-lookahead/behind are empty
    at the moment.

  3. If one tries to backreference a not-existing capture then it
    will be interpreted as an octal symbol.

  4. When encoding is ASCII, you can specify hex/octal values in
    the range 0-255. However when encoding is UTF8 then only the
    range 0-127 are valid, in this case the range 128-255 is undefined.

Call For Help

etablish contact, if you have interest in perl6 regexp.
etablish contact, if you have knowledge about asian text-encodings.


Simon Strandgaard


  1. :digit: ↩︎

  2. :space: ↩︎

  3. :word: ↩︎

w00t! Thanks so much, Simon!

How is the performance of yours vs. built-in? (On features which they both support.)

···

On Jun 2, 2004, at 10:57 AM, Simon Strandgaard wrote:

Encodings supported: ASCII, UTF8.
  (?: ... ) non-capturing group
  (?> ... ) atomic grouping
  (?= ... ) positive-lookahead
  (?! ... ) negative-lookahead . . . . . . . . . . . . . . . . . . . see [2]
  (?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . . . . see [1]
  (?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . . . . see [1], [2]

--
(-, /\ \/ / /\/

> Encodings supported: ASCII, UTF8.
> (?: ... ) non-capturing group
> (?> ... ) atomic grouping
> (?= ... ) positive-lookahead
> (?! ... ) negative-lookahead . . . . . . . . . . . . . . . . .
> . . see [2]
> (?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . .
> . . see [1]
> (?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . .
> . . see [1], [2]

w00t! Thanks so much, Simon!

I am happy you like it.. yesterday I added support for
UTF-16BE and UTF-16LE. Now im working on perl6 syntax.

How is the performance of yours vs. built-in? (On features which they
both support.)

performance hasn't really been benchmarked yet.
However we can compare against the time between Ruby's builtin (GNU)
engine..

First engine 0.11
'test_blackbox_p5.rb' takes 16.86 seconds for ~400 tests.
'test_blackbox_rubicon.rb' takes 15.93 seconds for ~1520 tests.
In total ~ 31 seconds for about 1900 regexp.
In average we can execute about 61 regexp's per second.

Then builtin GNU
'test_engine_builtin.rb' takes 2.96 seconds for 2000 tests.
The builtin can do 675 per second.

Lets calculate how many times GNU is faster
675 / 61 = 11
So GNU can do eleven times as many operations per second than mine.
This surprices me a little.. I thought my engine were way slower :wink:
I am thinking about reimplementing only the scanner in C++, in order
to get better performance. But first I must implement some of the
most common regexp optimizations: fastmaps and single-repeat.

Has anyone experience with how much speed can be gained
by reimplementing a ruby algorithm in C/C++ ?

my environment are:
bash-2.05b$ cat /proc/cpuinfo | grep MH
cpu MHz : 726.631
bash-2.05b$ uname -a
Linux server 2.4.25-gentoo-r1 #1 Sun Jun 6 18:09:28 CEST 2004 i686 AMD
Duron(TM)Processor AuthenticAMD GNU/Linux
bash-2.05b$ ruby18 -v
ruby 1.8.1 (2004-04-24) [i386-linux-gnu]
bash-2.05b$

···

On Friday 04 June 2004 17:46, Gavin Kistner wrote:

On Jun 2, 2004, at 10:57 AM, Simon Strandgaard wrote:

--
Simon Strandgaard

BTW: sorry for the 5 day delay.. I had to reinstall my system.
Actually I switched from FreeBSD to Gentoo Linux.