Writing a parser

Hello there,

I am struggling to get a grip on OO (and Ruby) and write a parser for
this particular data format described at the below get_record method.
The goal is to parse and emit records of this type and be able to
manipulate these with the methods in class Hash.

For now I am trying to get parsing to work, and the problem here is,
that the add methods is not defined for Hash. What is the best way of
doing that?

M

class Record < Hash
  SEPERATOR = "\n---\n"

  def initialize(path=nil)
    @path = path
    @ios = File.open(@path, 'r') if @path
  end

  # Reads the next 'record' from the I/O stream. The records are
separated by
  # lines of "---" and each record consists of a varying number of lines
with a
  # key/value pair separated by ": ". The record is returned as a hash.
  def get_record
    block = @ios.gets(SEPERATOR).chomp!(SEPERATOR)

    record = {}
     block.each_line do |line|
      fields = line.split(": ")
      raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

      record.add(fields[0],fields[1])
    end

    record
  end

  # Add a key value pair to this record.
  def add(key, value)
    raise ArgumentError, "Bad key contains ': ' or '\n' ->#{key}" if
key.match(": |\n")
    raise ArgumentError, "Bad value contains ': ' or '\n' ->#{value}" if
value.match(": |\n")

    self[key] = value
  end

  # Return the content of this record as a string.
  def to_s
    return nil if empty?
    self.map { | key, val | "#{ key }: #{ val }" }.join( "\n" ) +
SEPERATOR
  end
end

···

--
Posted via http://www.ruby-forum.com/.

Martin Hansen wrote:

Hello there,

I am struggling to get a grip on OO (and Ruby) and write a parser for
this particular data format described at the below get_record method.
The goal is to parse and emit records of this type and be able to
manipulate these with the methods in class Hash.

For now I am trying to get parsing to work, and the problem here is,
that the add methods is not defined for Hash. What is the best way of
doing that?

    record = {}
     block.each_line do |line|
      fields = line.split(": ")
      raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

      record.add(fields[0],fields[1])
    end

irb(main):001:0> record = {}
=> {}
irb(main):002:0> record[:name] = "Doe, John"
=> "Doe, John"
irb(main):003:0> record[:eye_color] = "Blue"
=> "Blue"
irb(main):004:0> record[:age] = 5
=> 5
irb(main):005:0> record
=> {:age=>5, :name=>"Doe, John", :eye_color=>"Blue"}

···

--
Posted via http://www.ruby-forum.com/\.

# Reads the next 'record' from the I/O stream. The records are
separated by
# lines of "---" and each record consists of a varying number of lines
with a
# key/value pair separated by ": ". The record is returned as a hash.
def get_record

I'm not sure why nobody bothered to point this out, or maybe I'm missing something...

require "yaml"
s = "---
a: b
c: d
"

p YAML.load(s)
# => {"a"=>"b", "c"=>"d"}

Also:

···

On Apr 16, 2010, at 05:33 , Martin Hansen wrote:

% eri YAML.load_documentse

(from ruby core)
------------------------------------------------------------------------------
  load_documents( io, &doc_proc )

------------------------------------------------------------------------------

Calls eblock with each consecutive document in the YAML stream contained
in eioe.

  File.open( 'many-docs.yaml' ) do |yf|
    YAML.load_documents( yf ) do |ydoc|
      ## ydoc contains the single object
      ## from the YAML document
    end
  end

Ah, yes, that is how a hash works :). Now I am interested in getting my
add method to work because it does some consistency checking on the key
and value. However, my add method is defined the wrong way/place.

···

--
Posted via http://www.ruby-forum.com/.

@Ryan,

This is not YAML. YAML allows arbitrarily complex data structures, and
the separator comes before the data (for some stupid reason). One could
possibly parse this data with a YAML parser, but my experience with the
YAML-syck parser (from Perl), which is written in C, is that it is much
slower than a simple parser. And I want speed.

@Robert,

I am still digesting :o) ...

M

···

--
Posted via http://www.ruby-forum.com/.

Martin Hansen wrote:

Ah, yes, that is how a hash works :). Now I am interested in getting my
add method to work because it does some consistency checking on the key
and value. However, my add method is defined the wrong way/place.

class Hash
  def add
    # your code here
  end
end

But monkey-patching can be a Bad Thing(tm) so be careful. I would
recommend doing your checks in your code, or maybe in a separate
function.
Or, you could even create a real Record object instead of just a hash!

class Record < Hash
  # studd
end

···

--
Posted via http://www.ruby-forum.com/\.

And you chose ruby? Ruby isn't exactly known for speed. And I can practically guarantee you that the syck parser will be faster for hashes of strings than your parser (as it is currently written and several minor iterations out).

But no, what you described IS yaml, it is just a subset of yaml. You can enforce the subset-ness if you feel like it, but I'm not a big fan of static typing, so I don't recommend it.

At the very least, you should look at StringScanner and a full rewrite, drop the silly Record class entirely and build up proper hashes of strings. And don't leave those dangling File objects everywhere. Read the whole file or use the block form of file.open.

···

On Apr 17, 2010, at 01:54 , Martin Hansen wrote:

This is not YAML. YAML allows arbitrarily complex data structures, and
the separator comes before the data (for some stupid reason). One could
possibly parse this data with a YAML parser, but my experience with the
YAML-syck parser (from Perl), which is written in C, is that it is much
slower than a simple parser. And I want speed.

Ah, of cause I am creating record of class Hash and not class Record.
This does not Err (I wonder if it is sane though):

class Record < Hash
  SEPERATOR = "\n---\n"

  include Enumerable

  def initialize(path=nil)
    @path = path
    @ios = File.open(@path, 'r') if @path
  end

  # Reads the next 'record' from the I/O stream. The records are
separated by
  # lines of "---" and each record consists of a varying number of lines
with
  # a key/value pair separated by ": ". The record is returned as a
hash.
  def get_record
    block = @ios.gets(SEPERATOR).chomp!(SEPERATOR)

    record = Record.new

    block.each_line do |line|
      fields = line.split(": ")
      raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

      record.add(fields[0],fields[1])
    end

    record
  end

  # Add a key value pair to this record.
  def add(key, value)
    raise ArgumentError, "Bad key contains ': ' or '\n' ->#{key}" if
key.match(": |\n")
    raise ArgumentError, "Bad value contains ': ' or '\n' ->#{value}" if
value.match(": |\n")

    self[key] = value
  end

  # Return the content of this record as a string.
  def to_s
    return nil if empty?
    self.map { | key, val | "#{ key }: #{ val }" }.join( "\n" ) +
SEPERATOR
  end
end

···

--
Posted via http://www.ruby-forum.com/.

And you chose ruby? Ruby isn't exactly known for speed. And I can
practically guarantee you that the syck parser will be faster for hashes
of strings than your parser (as it is currently written and several
minor iterations out).

But no, what you described IS yaml, it is just a subset of yaml. You can
enforce the subset-ness if you feel like it, but I'm not a big fan of
static typing, so I don't recommend it.

Yes, for I want to use Ruby to juggle the records - and then let's see
about the speed :o). I know from Perl that there is a quite a difference
between a home rolled parser and YAML-syck (I never saw any subset-ness
features).

Also, I am learning here. This OO thinking is very exotic ...

At the very least, you should look at StringScanner and a full rewrite,
drop the silly Record class entirely and build up proper hashes of
strings. And don't leave those dangling File objects everywhere. Read
the whole file or use the block form of file.open.

I will gladly consider a rewrite - I am afraid that I find Robert's
contribution confusing (am I unfair?). I read about StringScanner in the
Cookbook IIRC, but I better explore further. Why is the Record class
silly?

M

···

--
Posted via http://www.ruby-forum.com/\.

Inheriting from core classes is rarely a good idea. Basically your
records do not need to be more than Hashes - you just need a proper
implementation of fetching records.

class RecordSource
  include Enumerable

  def self.foreach(file, &b)
    if block_given?
      File.open(file) do |io|
        rs = new(io)
        rs.each_record(&b)
      end
    else
      Enumerator.new(self, :foreach, file)
    end
  end

  def self.open(file)
    File.open(file) do |io|
      yield new(io)
    end
  end

  def initialize(io)
    @io = io
  end

  def each_record
    curr = {}

    @io.each_line do |line|
      case line
      when /^([^:]+):(.*)$/
        curr[$1] = $2
      when /^---$/
        yield curr unless curr.empty?
        curr = {}
      else
        raise "Illegal format: #{line}"
      end
    end

    yield curr unless curr.empty?

    self # conventionally
  end

  alias :each :each_record
end

# usage

RecordSource.open "myfile" do |rs|
  rs.each_record do |rec|
    p rec
  end
end

RecordSource.foreach("myfile") do |rec|
  p rec
end

p RecordSource.foreach("myfile").to_a

Kind regards

robert

···

2010/4/16 Martin Hansen <mail@maasha.dk>:

Ah, of cause I am creating record of class Hash and not class Record.
This does not Err (I wonder if it is sane though):

class Record < Hash
SEPERATOR = "\n---\n"

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

And you chose ruby? Ruby isn't exactly known for speed. And I can
practically guarantee you that the syck parser will be faster for hashes
of strings than your parser (as it is currently written and several
minor iterations out).

But no, what you described IS yaml, it is just a subset of yaml. You can
enforce the subset-ness if you feel like it, but I'm not a big fan of
static typing, so I don't recommend it.

Yes, for I want to use Ruby to juggle the records - and then let's see
about the speed :o). I know from Perl that there is a quite a difference
between a home rolled parser and YAML-syck (I never saw any subset-ness
features).

Also, I am learning here. This OO thinking is very exotic ...

Yeah, it took me a while as well to grok it initially. I think it's
worthwhile, once simply because OO is mainstream nowadays but even
more importantly because it is a very powerful concept which lends
itself very well to modeling problems - especially in a language with
a clean and nice syntax as Ruby.

At the very least, you should look at StringScanner and a full rewrite,
drop the silly Record class entirely and build up proper hashes of
strings. And don't leave those dangling File objects everywhere. Read
the whole file or use the block form of file.open.

I will gladly consider a rewrite - I am afraid that I find Robert's
contribution confusing (am I unfair?).

You have every right to be confused. If you continue the thread at
the other end and point out which of my attempts to explain what's
going on caused you trouble, I can try to refine the explanation and
help clear things up.

Why is the Record class silly?

Not sure which remark you are referring to. I said that inheriting
core classes is rarely a good idea and if Hash functionality is all
you need for your records then you can simply use a Hash and just need
some other place to put the parsing / reading functionality.

Kind regards

robert

···

2010/4/17 Martin Hansen <mail@maasha.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert,

This got pretty advanced fast :/. Multiple things are going on at the
same time - and I have grown up learning that a piece of code should
only do one thing at a time - only one! I think, I need to back up a
step or two to understand what is going on. Perhaps you could enlighten
me on how you addressed this problem strategically?

And how would the appropriate Unit Tests for this thing look like?

Cheers,

M

···

--
Posted via http://www.ruby-forum.com/.

@Robert,

I was wondering about the whole approach, keeping in mind what Ryan
suggested (drop class Record which apparently is silly, and loose
dangling IO stuff). It strikes me that the simplest way to solve this is
to create an external iterator as a mixin (O'Reilly, Ruby Programming
Language, p 138). This is pretty much what you already suggested,
without the foreach method, which I don't need.

module Record
  include Enumerable

  def each
    loop { yield self.next }
  end
end

···

--
Posted via http://www.ruby-forum.com/.

This got pretty advanced fast :/. Multiple things are going on at the same time - and I have grown up learning that a piece of code should only do one thing at a time - only one!

Hey, that's a good thing to learn early! :slight_smile:

I think, I need to back up a step or two to understand what is going on. Perhaps you could enlighten me on how you addressed this problem strategically?

Well, basically I oriented myself on the way IO (and File) and similar classes work (for example CSV): they iterate all elements which are individually handed over to a block. That's the interface part. Basically this is also the contract of Enumerable; #each hands off every item to the block provided.

I separated the file opening from the iteration because you may want to iterate other types of things that respond to #each_line (ARGF comes to mind, but String and StringIO as well). This gives you the flexibility to pull the data from wherever you like.

Parsing itself is pretty straightforward I'd say; at least I am using the idiom frequently. You read line by line and decide what kind of line you see and what you need to do with it. Regular expressions are nice for this combined with a "case".

Finally, I aliased #each to #each_record so including Enumerable makes sense (Enumerable relies on #each being present). That gives you all the nice methods like #select, #find etc. for free. And I added the convenience method #foreach to the class so you can use that IO#foreach / File#foreach.

Downside of this approach is of course that you cannot fetch records individually like in your code where you have a method that fetches only the next record.

And how would the appropriate Unit Tests for this thing look like?

You could create a IO mock that returns lines as that of valid (and invalid!) records and verify that parsing returns exactly those records that you expect. Of course you need various tests for valid and invalid data etc.

Kind regards

  robert

···

On 04/16/2010 03:57 PM, Martin Hansen wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

@Robert,

I was wondering about the whole approach,

Erm, that's a rather broad statement. My approach had two parts:

a) Implementation of parsing which I changed from using gets() with a
delimiter because I found that approach too fragile (think of files on
different platforms with different line delimiters, in this case the
defined delimiter does not always work).

b) Embedding that in code that follows usual Ruby iterator patterns
and integrates seamless with Enumerable.

keeping in mind what Ryan
suggested (drop class Record which apparently is silly, and loose
dangling IO stuff). It strikes me that the simplest way to solve this is
to create an external iterator as a mixin (O'Reilly, Ruby Programming
Language, p 138). This is pretty much what you already suggested,
without the foreach method, which I don't need.

module Record
include Enumerable

def each
loop { yield self.next }
end
end

This implementation has flaws:

1. The loop does not terminate properly.

2. It does not return "self" which is convention for #each implementations.

3. Worst: there is a design issue here. An #each method of a Record
is expected to iterate through items _within_ (or contained in) the
Record. But what you are actually attempting here is to have
something that during iteration returns records (regardless of their
type). Since your method #each does not have arguments member
variables are needed to determine where to fetch records from. But
since you call this module "Record" you have a weird situation that
you need to create something Recordish (i.e. something with record
state) which also has non record state (for example an IO object) to
start returning other records from somewhere else. This is highly
confusing and hence something I would not let pass if I were doing the
code review on your code. :slight_smile:

Kind regards

robert

···

2010/4/19 Martin Hansen <mail@maasha.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

a) Implementation of parsing which I changed from using gets() with a
delimiter because I found that approach too fragile (think of files on
different platforms with different line delimiters, in this case the
defined delimiter does not always work).

I admit that you implementation will be more robust and more efficient.
However, this parser is meant to be run in a *NIX command line
environment so I can connect scripts with pipes (I should perhaps have
stated this in the "specs"). Actually, I have some more text on this
whole affair:

and

http://code.google.com/p/biopieces/wiki/Introduction#Getting_started

b) Embedding that in code that follows usual Ruby iterator patterns
and integrates seamless with Enumerable.

�def each
� �loop { yield self.next }
�end
end

This implementation has flaws:

1. The loop does not terminate properly.

But I think it does?!? From the Ruby book:

"External iterators are quite simple to use: just call next each time
you want another element. When there are no more elements left, next
will raise a StopIteration excep- tion. This may seem unusual—an
exception is raised for an expected termination condition rather than an
unexpected and exceptional event. (StopIteration is a de- scendant of
StandardError and IndexError; note that it is one of the only exception
classes that does not have the word “error” in its name.) Ruby follows
Python in this external iteration technique. By treating loop
termination as an exception, it makes your looping logic extremely
simple; there is no need to check the return value of next for a special
end-of-iteration value, and there is no need to call some kind of next?
predicate before calling next."

Not that I have been able to get the syntax right and make this work ...

2. It does not return "self" which is convention for #each
implementations.

I have heard of this convention before, but I seem to have missed it in
the Ruby book. I shall see if I can find it.

3. Worst: there is a design issue here. An #each method of a Record
is expected to iterate through items _within_ (or contained in) the
Record. But what you are actually attempting here is to have
something that during iteration returns records (regardless of their
type). Since your method #each does not have arguments member
variables are needed to determine where to fetch records from. But
since you call this module "Record" you have a weird situation that
you need to create something Recordish (i.e. something with record
state) which also has non record state (for example an IO object) to
start returning other records from somewhere else. This is highly
confusing and hence something I would not let pass if I were doing the
code review on your code. :slight_smile:

I was going to forget about the "silly" class Record and stick to hashes
of strings. So modifying the behavior of each to something like this:

include 'my_record'

File.open('my_file', 'r').each do |record|
  puts record.class # => Hash
end

Or is this completely nuts?

Thanks,

M

···

--
Posted via http://www.ruby-forum.com/\.

a) Implementation of parsing which I changed from using gets() with a
delimiter because I found that approach too fragile (think of files on
different platforms with different line delimiters, in this case the
defined delimiter does not always work).

I admit that you implementation will be more robust and more efficient.
However, this parser is meant to be run in a *NIX command line
environment so I can connect scripts with pipes (I should perhaps have
stated this in the "specs").

Well, there are also pipes on Windows. And even if your program is
run on a POSIX system the file might have been created on a Windows
box.

It does not matter that much whether you read from a pipe or file. If
you create something that only needs #each_line like I did you can
pass a wide range of sources - including ARGF which will happily read
from stdin if you do not provide file name arguments to your script.

Actually, I have some more text on this
whole affair:

www.biopieces.org

and

Google Code Archive - Long-term storage for Google Code Project Hosting.

b) Embedding that in code that follows usual Ruby iterator patterns
and integrates seamless with Enumerable.

�def each
� �loop { yield self.next }
�end
end

This implementation has flaws:

1. The loop does not terminate properly.

But I think it does?!?

Did you try it out?

From the Ruby book:

Which book?

"External iterators are quite simple to use: just call next each time
you want another element. When there are no more elements left, next
will raise a StopIteration excep- tion. This may seem unusual—an
exception is raised for an expected termination condition rather than an
unexpected and exceptional event. (StopIteration is a de- scendant of
StandardError and IndexError; note that it is one of the only exception
classes that does not have the word “error” in its name.) Ruby follows
Python in this external iteration technique. By treating loop
termination as an exception, it makes your looping logic extremely
simple; there is no need to check the return value of next for a special
end-of-iteration value, and there is no need to call some kind of next?
predicate before calling next."

Frankly, I dissent that logic. First, it uses exceptions for
*regular* control flow. This is bad in itself (I am pretty sure
you'll find plenty of articles on the web about this topic if you look
for it). Then, it does not even follow conventions implemented in the
Ruby standard library (for example, IO#gets returns nil on EOF).

Not that I have been able to get the syntax right and make this work ...

2. It does not return "self" which is convention for #each
implementations.

I have heard of this convention before, but I seem to have missed it in
the Ruby book. I shall see if I can find it.

3. Worst: there is a design issue here. An #each method of a Record
is expected to iterate through items _within_ (or contained in) the
Record. But what you are actually attempting here is to have
something that during iteration returns records (regardless of their
type). Since your method #each does not have arguments member
variables are needed to determine where to fetch records from. But
since you call this module "Record" you have a weird situation that
you need to create something Recordish (i.e. something with record
state) which also has non record state (for example an IO object) to
start returning other records from somewhere else. This is highly
confusing and hence something I would not let pass if I were doing the
code review on your code. :slight_smile:

I was going to forget about the "silly" class Record and stick to hashes
of strings. So modifying the behavior of each to something like this:

include 'my_record'

File.open('my_file', 'r').each do |record|
puts record.class # => Hash
end

Or is this completely nuts?

It does not work. File.open returns a File (or IO) instance.
Invoking #each on this will fill "record" with lines - not records.
Your line "record.class" will print "String" and not "Hash".

But I see that basically you want what I provided: a mechanism to
iterate records with #each.

At the moment I am not sure where we stand. You seem to insist that
something with my implementation is wrong and external iteration is
better - maybe because you have read about it in some book.
Realistically the need for external iteration rarely surfaces, mostly
it is if you need to iterate two or more collections in lock step. In
all other cases using #each and like mechanisms is superior because it
is less error prone.

Cheers

robert

···

2010/4/20 Martin Hansen <mail@maasha.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

At the moment I am not sure where we stand. You seem to insist that
something with my implementation is wrong and external iteration is
better - maybe because you have read about it in some book.
Realistically the need for external iteration rarely surfaces, mostly
it is if you need to iterate two or more collections in lock step. In
all other cases using #each and like mechanisms is superior because it
is less error prone.

Robert, your help is much appreciated. I have been researching and
trying to get a grasp at the topics at hand: mixins, enumerable, etc -
and trying to keep an open mind to alternatives. Todays task is to get
your suggestion up and running and play with it by writing a bunch of
unit tests (a great way to learn btw). Me needs to become friends with
stubs (stringio) and mocks (flexmock) ...

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.