[ANN] cursor-0.6

I am very interested in Cursor for use in Reg.

Thanks for the interest. If you want to help out, let me know.
I think the API is stabilizing. The only functional change in
my mind right now is having "position"s shift with insertions
and deletions before them. A lot of optimization is needed
(i.e. derived classes do one element at a time and #pos/#pos=
is typically O(n) instead of O(1)). I'm moving on to Grammar
right now.

It seems like
the sort
of thing to allow Reg's Array-based backtracking pattern
matcher to
operate on Strings or Files or whatever else.
(Did you know that I needed this by esp? :slight_smile: Since you're got
your own
lexer/parser/pattern matcher thing, (grammar?) maybe you
don't want to
help the competition...

Competition is good for the consumer :slight_smile: The competitors
enhance their own products and use each others ideas.

Since we are both doing parsing type things, I can see how this
might be useful to you.

Anyway, cursor isn't quite functional enough for my purposes.
Passing
patterns for the length parameter to #get and #set is great,
but there
aren't enough pattern types. I think I need the equivalent of
Cursor#===(regexp), which is to say, the ability to compare a
Regexp
to a position within a String. Or, if the backing store of
the cursor
is a file, you get the ability to compare a Regexp to a File.
Someone
was asking for that recently, it's not easy, but it would be
great if
your lib did it.... just the ability to regexp against the
middle of a
String would make me very happy, tho. I've hacked this up
before, but
I think that a reasonably efficient version would require a
bit of c
extension.

Funny you ask for that - cursor===regexp. Instead, my grammar
package will support grammar===cursor. I think the only case I
could fully implement cursor===regexp is when the data the
cursor is accessing is a string. I can't do it for the case
when the cursor is a IO (unless I read the whole file into a
String). Regexp is too tied to String. Also, right now,
nothing in Cursor is tied specifically to Strings (or
characters). I want to keep it that way - dealing with
elements and element sequences.

This === pattern matching would be better placed in the pattern
matching class if it is possible - regexp=== cursor,
grammar===cursor, or reg===cursor.

Ideally, of course, I want to see the ability to compare
against a
Reg, particularly the Regs that match multiple array
elements. If you
made it work with whatever the equivalent in your system is,
that
would be almost as good.

How does Grammar work? I would appreciate if you could write
a couple
of paragraphs of high-level overview of Cursor, Grammar,
their design
and capabilites and how they're intended to work together.

Cursor is a unification of many external iterator ideas -
iterators from C++, streams from Java, iterators from Java,
Ruby IO, and a text editor "cursor". Compared to most other
general external iterators, Cursor offers two interesting
features - ability to insert/delete and the ability to
save/restore a position.

Since I haven't released grammar yet, its predecessor will give
you an idea:

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/138450

In this, I'm using RandomAccessStream instead of Cursor. And
Syntax is equivalent to the Grammar I'm working on.

A Grammar is used to match what's next in Cursor. Like Cursor,
it can deal with any objects, not just characters and Strings.

I provide some operator overloading to make specifying a
Grammar like BNF - "|" (Grammar::Alteration), "+"
(Grammar::Sequence), "*" (Grammar::Repeat), etc. I also put
some of these in built-in classes (String, Range) to make it
even easier (i.e. "hi"|"hello" makes a Grammar that matches
either "hi" or "hello").

With this you can make a parser that works directly on the
Cursor (from an IO, String, etc) or you can split it up into
lexer and parser. In this case, the parser (a Grammar) would
deal with a Cursor that spits out tokens. You can make your
own lexer for that or use Grammar again. Using Grammar to make
a lexer you'd have first need the Grammar for a token which
should be a big Alteration (or a table lookup). This token
Grammar would also need to tag these appropriately to
distinguish token types for the parser. The lexer would then
be a Cursor which would returns matches from the token Grammar
against the original Cursor. Here is some psuedo code to give
an idea of what was just said:

token = (tokenGrammar===ioCursor)

tokenCursor = Cursor of tokens (lexer)

parsetree = (parserGrammar===tokenCursor)

Like the ideas?

···

--- Caleb Clausen <vikkous@gmail.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site

Eric Mahurin wrote:

Funny you ask for that - cursor===regexp. Instead, my grammar
package will support grammar===cursor. I think the only case I
could fully implement cursor===regexp is when the data the
cursor is accessing is a string. I can't do it for the case
when the cursor is a IO (unless I read the whole file into a
String). Regexp is too tied to String. Also, right now,
nothing in Cursor is tied specifically to Strings (or
characters). I want to keep it that way - dealing with
elements and element sequences.

This === pattern matching would be better placed in the pattern
matching class if it is possible - regexp=== cursor,
grammar===cursor, or reg===cursor.

This is the right way, with the pattern on the left side of === and
the data on the right. (I've done it the other way before, but it's
yukky.) However, I still want a more sophisticated matcher built-in to
Cursor, else I'm not sure how to make use of Cursors.

Regexp#===(FileCursor) is just as hard as the other way around. How
would that be implemented?

I provide some operator overloading to make specifying a
Grammar like BNF - "|" (Grammar::Alteration), "+"
(Grammar::Sequence), "*" (Grammar::Repeat), etc. I also put
some of these in built-in classes (String, Range) to make it
even easier (i.e. "hi"|"hello" makes a Grammar that matches
either "hi" or "hello").

I've done much the same with Reg. However, I don't mix-in these
capabilites to any standard class except Regexp. (The user can always
request that they be mixed in, tho, so my equivalent to your
"hi"|"hello" is "hi".reg|"hello", which I admit is a bit ugly.)

Are you allowing Regexps in your language? Or is the idea that users
can build up Regexp-like patterns with the (equivalent) mechanisms
you're supplying? Regexps are one of the few things in ruby that are
actually fast; doing your own will be much slower...

I suppose you could translate a Regexp into an equivalent Grammar (or
Reg), but again, it will be slow. (And a lot of work.)

I wish you could tell me more about Grammar (or Syntax) internals...
Is it an LL or LR thang? How is matching actually done at runtime? Do
you compile to a parse table? Or 'interpret' the parsing at runtime,
as I am doing. My own system is Regexp-like, featuring an mostly
unoptimized DFA engine. Reg::Constant makes possible recursive
patterns, which makes the whole thing LL-equivalent (I think).

What happens when this code runs:
  "foo"|"bar"|"baz"===cursor

Does cursor get scanned 3 times, for "foo", then "bar", then "baz"?
That's going to be slow... especially if cursor is a FileCursor.
Alternation doesn't need to be so slow... you can scan until you see
the first f or b, then attempt a match from there. This eliminates the
need to go through the cursor's data multiple times. But I don't see
support for this kind of thing in Cursor. I understand your need to
keep Cursor clean desire to get on to the fun stuff, but I beg you to
add this one additional feature: to read data from the cursor position
until you find a member of a set of items given to you by the user. I
think this capability makes possible an NFA engine atop Cursor, which
should be reasonably fast. And it ought to be easy enough for you.

In other words, I'd like to see Cursor#get(Set) or
Cursor#get(CharSet), where which one you pass depends on the type of
cursor. (This actually gets kind of complicated when it comes to
objects... in addition to a Set, you'd like to be able to take a Proc
too... or maybe just an object that responds to ===.

With this you can make a parser that works directly on the
Cursor (from an IO, String, etc)

Please tell me how this would work, because I have never been able to
envisage it properly, but I like the idea.

token = (tokenGrammar===ioCursor)

tokenCursor = Cursor of tokens (lexer)

parsetree = (parserGrammar===tokenCursor)

Like the ideas?

Well, yes, very much so. Actually, I'm suspecting esp again.

Have you given any thought to having a Cursor into the object graph,
or Cursors into graphs in general? This is related to something I want
to do with Reg, but I haven't figured it out yet...

Oh, and incidently, Grammar::Alteration should really be Grammar::Alternation.

Eric Mahurin wrote:
> Funny you ask for that - cursor===regexp. Instead, my
grammar
> package will support grammar===cursor. I think the only
case I
> could fully implement cursor===regexp is when the data the
> cursor is accessing is a string. I can't do it for the
case
> when the cursor is a IO (unless I read the whole file into
a
> String). Regexp is too tied to String. Also, right now,
> nothing in Cursor is tied specifically to Strings (or
> characters). I want to keep it that way - dealing with
> elements and element sequences.
>
> This === pattern matching would be better placed in the
pattern
> matching class if it is possible - regexp=== cursor,
> grammar===cursor, or reg===cursor.

This is the right way, with the pattern on the left side of
=== and
the data on the right. (I've done it the other way before,
but it's
yukky.) However, I still want a more sophisticated matcher
built-in to
Cursor, else I'm not sure how to make use of Cursors.

Regexp#===(FileCursor) is just as hard as the other way
around. How
would that be implemented?

I think you'd have to rewrite it from scratch in Ruby or do
stuff in C. Like I said, Regexp is too tied to String. If it
could handle String-like objects, you might be able to do it.
Or if it would have a callback when it reached the end of the
string (or the matched failed) that might help. Unfortunately,
there isn't enough control.

> I provide some operator overloading to make specifying a
> Grammar like BNF - "|" (Grammar::Alternation), "+"
> (Grammar::Sequence), "*" (Grammar::Repeat), etc. I also
put
> some of these in built-in classes (String, Range) to make
it
> even easier (i.e. "hi"|"hello" makes a Grammar that matches
> either "hi" or "hello").

I've done much the same with Reg. However, I don't mix-in
these
capabilites to any standard class except Regexp. (The user
can always
request that they be mixed in, tho, so my equivalent to your
"hi"|"hello" is "hi".reg|"hello", which I admit is a bit
ugly.)

Are you allowing Regexps in your language? Or is the idea
that users
can build up Regexp-like patterns with the (equivalent)
mechanisms
you're supplying? Regexps are one of the few things in ruby
that are
actually fast; doing your own will be much slower...

I have no plans to put Regexps in Grammar because I don't know
how to get them to work on anything but a String (my main
target would be parsing - an IO). The problem is you have to
everything you might want to match in a String before matching
with a Regexp. Think about a Regexp for a multi-line comment -
how many lines do you read into a String before attempting to
match? If you have any ideas of how to use them more generally
I'm listening. I would like to incorporate them because of the
speed issue you said.

With what I'm doing there is no reason that couldn't use
Regexps to cobble a lexer together that implements the Cursor
interface and then use Grammar as the parser.

I suppose you could translate a Regexp into an equivalent
Grammar (or
Reg), but again, it will be slow. (And a lot of work.)

I wish you could tell me more about Grammar (or Syntax)
internals...
Is it an LL or LR thang? How is matching actually done at
runtime? Do
you compile to a parse table? Or 'interpret' the parsing at
runtime,
as I am doing.

What I have now is quite simple right now. It is LL. But, it
doesn't do the conventional lookahead for determining which
choice to take on an alternation. Instead, it simply tries
each choice in order and rewinds when a choice fails. Later, I
might change this - doing LL(1) when there is no ambiguity.
But, right now it effectively has infinite lookahead. What I'm
providing here is just a set of convienences that allow you to
build a parser like you would do by hand. The behavior is very
predictable.

My own system is Regexp-like, featuring an
mostly
unoptimized DFA engine. Reg::Constant makes possible
recursive
patterns, which makes the whole thing LL-equivalent (I
think).

What happens when this code runs:
  "foo"|"bar"|"baz"===cursor

Does cursor get scanned 3 times, for "foo", then "bar", then
"baz"?
That's going to be slow... especially if cursor is a
FileCursor.

It does get scanned 3 times in this case. But, you will not
want to use Cursor::IO directly. You'll use Cursor::Buffered
instead so that you only read the file once and from the buffer
the other 2 times.

My take right now is to leave the details of how the parsing is
up to whoever is specifying the grammar. I would rather not
have too much magic going on. So in the above example, you
would be better off saying this to specify the Grammar:

"foo"|("ba"+("r"|"z"))

Alternation doesn't need to be so slow... you can scan until
you see
the first f or b, then attempt a match from there. This
eliminates the
need to go through the cursor's data multiple times. But I
don't see
support for this kind of thing in Cursor. I understand your
need to
keep Cursor clean desire to get on to the fun stuff, but I
beg you to
add this one additional feature: to read data from the cursor
position
until you find a member of a set of items given to you by the
user. I
think this capability makes possible an NFA engine atop
Cursor, which
should be reasonably fast. And it ought to be easy enough for
you.

The main reason I added the various options to #get was that
you have the same options in IO (implemented in C), you you'll
get a speed advantage. get(nil): IO#getc, get(n): IO#read(n),
get(eol): IO#gets(eol). I don't think adding what you are
talking about would add any efficiency if it was placed in
Cursor as opposed to the pattern-matching class (Grammar or
Reg).

In other words, I'd like to see Cursor#get(Set) or
Cursor#get(CharSet), where which one you pass depends on the
type of
cursor. (This actually gets kind of complicated when it comes
to
objects... in addition to a Set, you'd like to be able to
take a Proc
too... or maybe just an object that responds to ===.

Maybe just a Proc to handle all of these cases - passing the
current string/array. But, I still don't think you'd get much
more efficiency compared to having the Grammar/Reg do this (one
element at a time).

> With this you can make a parser that works directly on the
> Cursor (from an IO, String, etc)

Please tell me how this would work, because I have never been
able to
envisage it properly, but I like the idea.

> token = (tokenGrammar===ioCursor)
>
> tokenCursor = Cursor of tokens (lexer)
>
> parsetree = (parserGrammar===tokenCursor)
>
> Like the ideas?

Well, yes, very much so. Actually, I'm suspecting esp again.

Have you given any thought to having a Cursor into the object
graph,
or Cursors into graphs in general? This is related to
something I want
to do with Reg, but I haven't figured it out yet...

Not really. Cursor represents a movable pointer into a
sequential (ordered) data structure. arrays, strings, linked
lists, files, buffers, and the like make sense with this. I
guess you could assign order to the nodes in the graph and put
a Cursor around it, but I don't know if it would be that
useful.

If you want to take this off-line, you are welcome to email me
directly.

I'll try to release something for Grammar in the next week or
two.

···

--- Caleb Clausen <vikkous@gmail.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site

[snip]

I think you'd have to rewrite it from scratch in Ruby or do
stuff in C. Like I said, Regexp is too tied to String. If it
could handle String-like objects, you might be able to do it.
Or if it would have a callback when it reached the end of the
string (or the matched failed) that might help. Unfortunately,
there isn't enough control.

[snip]

I have made a regexp lib where you can supply your own
iterators if you have string like classes.

http://rubyforge.org/frs/download.php/1311/regexp-engine-0.12.zip

Maybe it has interest?

···

On 5/28/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--
Simon Strandgaard

[snip]
> I think you'd have to rewrite it from scratch in Ruby or do
> stuff in C. Like I said, Regexp is too tied to String. If
it
> could handle String-like objects, you might be able to do
it.
> Or if it would have a callback when it reached the end of
the
> string (or the matched failed) that might help.
Unfortunately,
> there isn't enough control.
[snip]

I have made a regexp lib where you can supply your own
iterators if you have string like classes.

http://rubyforge.org/frs/download.php/1311/regexp-engine-0.12.zip

Maybe it has interest?

Possibly. From what I could tell this is pure ruby. So it
offers a lot more flexibility than the C Regexp. But, you
loose all of the speed advantage, right? For my purposes, that
was the only reason I wanted to incorporate Regexp - because
Grammar will be able to do Regexp like stuff - and more.

···

--- Simon Strandgaard <neoneye@gmail.com> wrote:

On 5/28/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site

[snip]

> I have made a regexp lib where you can supply your own
> iterators if you have string like classes.
>
>
http://rubyforge.org/frs/download.php/1311/regexp-engine-0.12.zip
>
> Maybe it has interest?

Possibly. From what I could tell this is pure ruby. So it
offers a lot more flexibility than the C Regexp. But, you
loose all of the speed advantage, right?

Yes, its far from the same speed as C.

For my purposes, that
was the only reason I wanted to incorporate Regexp - because
Grammar will be able to do Regexp like stuff - and more.

A grammar class sounds interesting..
Any examples of usage ?

···

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--- Simon Strandgaard <neoneye@gmail.com> wrote:

--
Simon Strandgaard

[snip]
> > I have made a regexp lib where you can supply your own
> > iterators if you have string like classes.
> >
> >
>

http://rubyforge.org/frs/download.php/1311/regexp-engine-0.12.zip

> >
> > Maybe it has interest?
>
> Possibly. From what I could tell this is pure ruby. So it
> offers a lot more flexibility than the C Regexp. But, you
> lose all of the speed advantage, right?

Yes, its far from the same speed as C.

> For my purposes, that
> was the only reason I wanted to incorporate Regexp -
because
> Grammar will be able to do Regexp like stuff - and more.

A grammar class sounds interesting..
Any examples of usage ?

I still don't have a first release of grammar. But, you can
look at the predecessor:

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/138450

I changed the name from syntax to grammar because there is
already a syntax package for syntax hilighting.

The usage for the Grammar classes will be very similar. Each
type of Grammar is a derived class - Sequence, Alternation,
Repeat, etc. === is used to match a Grammar against a Cursor.
And I also put stuff in String and Range to make creating a
Grammar even easier.

···

--- Simon Strandgaard <neoneye@gmail.com> wrote:

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> --- Simon Strandgaard <neoneye@gmail.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site

[snip]

> A grammar class sounds interesting..
> Any examples of usage ?

I still don't have a first release of grammar. But, you can
look at the predecessor:

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/138450

I changed the name from syntax to grammar because there is
already a syntax package for syntax hilighting.

The usage for the Grammar classes will be very similar. Each
type of Grammar is a derived class - Sequence, Alternation,
Repeat, etc. === is used to match a Grammar against a Cursor.
And I also put stuff in String and Range to make creating a
Grammar even easier.

Have you any unittests for your syntax/grammar code ?

Your === operator reminds me of my own #match methods, in
this file.

http://rubyforge.org/cgi-bin/viewcvs.cgi/projects/regexp_engine/source/scanner_nodes.rb?rev=1.22&cvsroot=aeditor&content-type=text/vnd.viewcvs-markup

···

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--- Simon Strandgaard <neoneye@gmail.com> wrote:

--
Simon Strandgaard

[snip]
> > A grammar class sounds interesting..
> > Any examples of usage ?
>
> I still don't have a first release of grammar. But, you
can
> look at the predecessor:
>
>

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/138450

>
> I changed the name from syntax to grammar because there is
> already a syntax package for syntax hilighting.
>
> The usage for the Grammar classes will be very similar.
Each
> type of Grammar is a derived class - Sequence, Alternation,
> Repeat, etc. === is used to match a Grammar against a
Cursor.
> And I also put stuff in String and Range to make creating a
> Grammar even easier.

Have you any unittests for your syntax/grammar code ?

Not yet.

Your === operator reminds me of my own #match methods, in
this file.

http://rubyforge.org/cgi-bin/viewcvs.cgi/projects/regexp_engine/source/scanner_nodes.rb?rev=1.22&cvsroot=aeditor&content-type=text/vnd.viewcvs-markup

Yes, this is quite similar. You, Caleb, and I look to have
very similar code and needs - external iterators and pattern
matches/parsers.

···

--- Simon Strandgaard <neoneye@gmail.com> wrote:

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> --- Simon Strandgaard <neoneye@gmail.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site