How serious are you about this? Several years ago I wrote a Python library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been incredibly useful to me. I've started
to port it to Ruby, but simply don't have the time. If you do (you're probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including decent documentation), I'm happy to pass
on the Python code, the Ruby code, and give advice and so on.
To help you evaluate this, and also as a potential source of ideas in case you do something else, I've
appended my (probably out of date) intro text to the library at the bottom of this reply.
Ari Brown wrote:
So start writing! and research other DSLs as you go.
Ugh. If I must (which I must). What would you suggest as syntax?
Also, should I completely try to reinvent the wheel, or create a wrapper for current RegExp?
Man. I need a mentor on this
IMO, Arabic has THE most beautiful script.
Poetically, English is extremely beautiful. It's like a language of RegExp - except there are no rules!
Spoken, the most beautiful language is either French (sorry) or Esperanto.
Text from the _Python_ library (In retrospect, I would do quite a bit different):
On Aug 6, 2007, at 9:40 PM, Phlip wrote:
'rex' provides regular expression and parsing facilities. It uses (and is intended to functionally
replace) the Python 're' module.
Regular expression functionality is provided through the '_Rexp' and 'MatchResult' classes,
and the CHAR, REP0, REP1, OPT, PAT, and ALT constructs.
These constructs can be used as or provide functions to create rexps, and also define
attributes for commonly used rexps. (For example, PAT.float provides a rexp
which matches basic floating-point (no exponent) numbers.)
If you are familiar with regular expressions, the following will probably make at
least some sense. If you are not, skip this example for now. In either case, come
back to it once you have have read the formal definitions of functions and
constructs provided by rex.
COMPLEX= PAT.float['re'] + \
REP0.whitespace + \
ALT("+", "-")['op'] + \
REP0.whitespace + \
PAT.float['im'] + \
The above example defines a pattern which will match complex
numbers, of the form "-2.718 + 3.14i", for example. It uses the predefined
match expressions PAT.float and REP0.whitespace to
ease the definition. Applied to the example complex number string, the result will contain three
named substrings: 're' will map to "-2.718", "op" will map to "+", and "im" will map to "3.14".
SEQ is an alternative form of joining rexps; the above is equivalent to:
This is an introduction to using the pattern-matching (regular-expression-related)
part of rex. See documentation associated
with a specific method/function/name for details on that entity.
In the following, we use the abbreviation RE to refer to standard regular
expressions defined as strings, and the word 'rexp' to refer to rex objects
which denote regular expressions.
The starting point for building a rexp is either rex.PAT,
which we'll just call PAT, or rex.CHAR, which we'll just call CHAR, or rex.LIT.
CHAR provides rexps defining a set of characters, and which
will match a single character string if that character is in the given
set. In addition to providing attributes which provide prebuilt character
sets, the CHAR function may be used to define your own character
LIT builds rexps which match strings of varying lengths.
REP0 and REP1 are zero or more and 1 or ore
- PAT._someattribute_ returns (for defined attributes) a corresponding rexp.
For example, PAT.stringstart returns a rexp matching at the start of a string.
- CHAR(a1, a2, . . .) returns a rexp matching a single character from a set
of characters defined by its arguments. For example, CHAR("-", ["0","9"], ".")
iter the characters necessary to build basic floating point numbers.
See CHAR docs for details.
- CHAR._someattribute_ returns (for defined attributes) a corresponding rexp
defining a set of characters.
For example, CHAR.digit returns a rexp matching a single digit.
Now assume that A, B, C,... are rexps. The following Python expressions
(_not_ strings) may be used to build more complex rexps:
- X | Y | Z . . . : returns a rexp which iter a string if any of the operands
match that string. Similar to "X|Y|Z" in normal REs, except of course you can't
use Python code to define a normal RE.
- X + Y + Z ...: returns a rexp which iter a string if all of X, Y, Z match consecutive
substrings of the string in succession. Like "XYZ" in normal REs.
- X*n : returns a rexp which iter a number of times as defined by n.
This replaces '?', '+', and '*' as used in normal REs. See docs for details.
'rex' defines constants which allow you to say X*REP0, X*REP1, or X*MAYBE,
indicating (0 or more iter), (1 or more iter), or (0 or 1 iter),
- X**n : Like X*n, but does nongreedy matching.
- +X : positive lookahead assertion: iter if X iter, but doesn't
consume any of the input.
- ~+X : negative lookahead assertion: iter if X _doesn't_ match,
but doesn't consume any of the input.
- -X, ~-X : positive and negative lookback assertions. Lke lookahead assertions,
but in the other direction.
- X[name] : name must be a string: any matched by X can be referred
to by the given name in the match result object. (This is the equivalent
of named groups in the re module).
- X.group() : X will be in an unnamed group, referable by number.
In addition, a few other operations may be performed:
- Some of the attributes defined in PAT have "natural inverses"; for such
attributes, the inverse may be taken. For example, ~PAT.digit is
a pattern matching any character except a digit.
- Character classes may be inverted: ~CHAR("aeiouAEIOU") returns a pattern
matching any except a vowel.
- 'ALT' gives a different way to denote alternation: ALT(X, Y, Z,...) does
the same thing as X | Y | Z | . . ., except that none of the arguments
to ALT need be rexps; any which are normal strings will be converted
to a rexp using PAT.
- 'SEQ' can take multiple arguments: PAT(X, Y, Z,...), which gives the same
result as PAT(X) + PAT(Y) + PAT(Z) + . . . .
Finally, a very convenient shortcut is that only the first object in a sequence of
operator/method calls needs to be a rexp; all others will be automatically
converted as if LIT(...) had been called on them. For example, the
sequence X | "hello" is the same as X | LIT("hello")