Creating a new Syntax tokenizer

I want to write my own wiki markup language. Pure regexp fails me, as I need a proper parser to keep track of state.
I thought I'd give Syntax a try, but I'm a little confused as to some of the specifics.

1) What is a 'region', and how do I use the start_region method? It's not documented in the API, or the source. (I think this is what I want for nesting tags.)

2) Do I have to close_group and close_region, or do they automatically get invoked under certain circumstances? (Does starting one group close the previous one? Do repeated calls to open the same group cause them to be aggregated together (is that how accumulating text in :normal groups works?)

3) How do I keep track of state during successive calls to #step? I tried an instance variable, but that doesn't seem to exist across calls.

Following is my terrible, broken attempt at the basics of what I'm after. Am I totally misunderstanding how to use Syntax?

require 'rubygems'
require_gem 'syntax'

class OWLScribble < Syntax::Tokenizer
   def step
         if heading = scan( /^={1,6}/ )
             start_region "heading level #{heading.length}".intern
             $heading_end = Regexp.new( heading + "\\s*" )
         elsif $heading_end && ( heading = scan( $heading_end ) )
             end_region "heading level #{heading.length}".intern
             $heading_end = nil
         elsif char = scan( /^[\r\n]/ )
             start_group :paragraph, char
         elsif scan( /\*\*/ )
             if $inbold
                 end_region :bold
                 $inbold = nil
             else
                 start_region :bold
                 $inbold = true
             end
         elsif char = scan( /./ )
             start_group :normal, char
         else
             scan( /[\r\n]/ )
         end
   end
end

Syntax::SYNTAX[ 'owlscribble' ] = OWLScribble

str = <<END
Intro paragraph

= Heading 1 =
First **paragraph** under the heading.

== Second **Heading** = very yes ==
Another paragraph.
END

tokenizer = Syntax.load( "owlscribble" )
tokenizer.tokenize( str ) do |token|
   puts "#{token.group} (#{token.instruction}) #{token}"
end

···

--
(-, /\ \/ / /\/

I want to write my own wiki markup language. Pure regexp fails me, as I need a proper parser to keep track of state.
I thought I'd give Syntax a try, but I'm a little confused as to some of the specifics.

1) What is a 'region', and how do I use the start_region method? It's not documented in the API, or the source. (I think this is what I want for nesting tags.)

Regions are groups that can contain other groups nested within them. Syntax's Ruby tokenizer uses regions to do syntax highlighting of strings, and interpolated expressions, for example.

start_region is used identically to start_group--you give it the name of the group you want to start (or continue, if that group is already open), and an optional string to get things started. (The string becomes the starter text for the group.)

2) Do I have to close_group and close_region, or do they automatically get invoked under certain circumstances? (Does starting one group close the previous one? Do repeated calls to open the same group cause them to be aggregated together (is that how accumulating text in :normal groups works?)

close_group is automatically called when you start a new group. close_region is never automatically called, because regions can be nested, so unless you have a region that you want to persist to the end of your document, you need to explicitly call it at some point.

Multiple calls of start_group with the same group name do, indeed, get concatenated together into a single group.

3) How do I keep track of state during successive calls to #step? I tried an instance variable, but that doesn't seem to exist across calls.

Instance variables should work--I use them successfully in the Ruby tokenizer, for instance. Feel free to contact me off-list and I can help troubleshoot this if it isn't working for you.

Following is my terrible, broken attempt at the basics of what I'm after. Am I totally misunderstanding how to use Syntax?

Without actually trying to run it, I'd say you've got the idea. This is an interesting use of Syntax--given that Syntax was intended for use as a highlighter, I wouldn't have thought to use it as a more general purpose parser, but it can definitely be used for that. Clever! :slight_smile:

- Jamis

···

On Jun 19, 2005, at 7:10 PM, Gavin Kistner wrote:

3) How do I keep track of state during successive calls to #step? I tried an instance variable, but that doesn't seem to exist across calls.

Instance variables should work--I use them successfully in the Ruby tokenizer, for instance. Feel free to contact me off-list and I can help troubleshoot this if it isn't working for you.

Hrm, I'll have to try it again sometime. Perhaps I screwed up when I tried it.

Following is my terrible, broken attempt at the basics of what I'm after. Am I totally misunderstanding how to use Syntax?

Without actually trying to run it, I'd say you've got the idea. This is an interesting use of Syntax--given that Syntax was intended for use as a highlighter, I wouldn't have thought to use it as a more general purpose parser, but it can definitely be used for that. Clever! :slight_smile:

Thanks for the help in your response. As noted in my OWLScratch post, after thinking about the problem more and more, I decided that I wanted a more state-based solution than Syntax seemed to surround, but I want to thank you very much for the library, as the concepts in it really helped me in my thinking (and introduced me to StringScanner - what a gem!).