BibTeX parser

Hi all,

I have a difficult problem and I need some smart people to give me a hand.
So I knew where to go for that. :slight_smile:

I’m trying to figure out how to write a parser fo BibTeX files. It’s not
easy. A single BibTeX entry might look like this:

@BOOK{texbook,
author = “Donald O’brian”,
title = “The {{\TeX}book}”,
publisher = ‘Addison-Wesley’,
year = 1984,
key = {Don’s key}
}

I think you can see the problem.

There is a nested collection of {squigly} brackers, as well as “double
quotes” and ‘single quotes’. I’m not sure, either how to represent this
structure, nor how to parse it.

If I only had to deal with {brackets} I could use an n-ary tree. And to
parse it, I would start with one node, move one character at a time.
Every time I see a { I’d make a new node. Every time I saw a } I would
come back up.

Now, when you and “double” quotes, the problem becomes more complicated,
but doable. I could first extract all the quotes and use an array where
quoted and non-qutoed text alternates (for instance) and then parse using
the brackets to make an n-ary tree.

But if I have ‘single’ quotes also, things can get very complicated. I
will have to deal with thigns like:

{Dan’s book}

and

“O’brian”

And at this point I am truly at a loss.

I hope one of the more experienced programmers here can offer some
insight.

Thanks a lot,

···


Daniel Carrera | No trees were harmed in the generation of this e-mail.
PhD student. | A significant number of electrons were, however, severely
Math Dept. UMD | inconvenienced.

Hi all,

I have a difficult problem and I need some smart people to give me a hand.
So I knew where to go for that. :slight_smile:

I’m trying to figure out how to write a parser fo BibTeX files.
[…]
If I only had to deal with {brackets} I could use an n-ary tree. And to
parse it, I would start with one node, move one character at a time.
Every time I see a { I’d make a new node. Every time I saw a } I would
come back up.

Now, when you and “double” quotes, the problem becomes more complicated,
but doable. I could first extract all the quotes and use an array where
quoted and non-qutoed text alternates (for instance) and then parse using
the brackets to make an n-ary tree.

But if I have ‘single’ quotes also, things can get very complicated.
[…]
And at this point I am truly at a loss.

Looks like you’re doing the parser by hand… wouldn’t it be easier
with, say, racc? As for the lexer, you could simply split (well, not
String#split but you get the idea) on spaces & special chars ({}"');
creating a grammar to handle this should be fairly easy. Another
advantage is that you could build an AST and use it to represent the
data. If needed you could simplify it later to transform “recursive”
nodes (i.e. those resulting from recursive productions) into arrays;
this is more convenient and IIRC it’s what you’d get with Rockit. You
might also want to try the latter, but in my past experience I found it
to be too buggy :frowning:

mmm I guess Coco/Rb could be a good option too, since you also get a
lexer, and LL(1) should be enough for this.

···

On Fri, Jan 23, 2004 at 03:34:27AM +0900, Daniel Carrera wrote:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

If loving linux is wrong, I dont wanna be right.
– Topic for #LinuxGER

Looks like you’re doing the parser by hand…

Yes, because I am a parser-newbie and I don’t know better.

wouldn’t it be easier with, say, racc? As for the lexer [snip]

What’s racc? Do you have a link?

What’s a lexer?

Where can I learn how to make good parsers? I’d really like to do this
right.

creating a grammar to handle this should be fairly easy.

Beautiful! I like easy. :slight_smile:

Another advantage is that you could build an AST and use it to represent
the data.

I’ll also need a link where I can learn what an AST is.

<snip: Rockit />

but in my past experience I found it to be too buggy :frowning:

Buggy is bad. I’ll stick to something reliable.

mmm I guess Coco/Rb could be a good option too, since you also get a
lexer, and LL(1) should be enough for this.

Links for Coco/Rb?

Thanks for all the help! I just knew I was doing something wrong.
Thanks for pointing me the right direction.

I will do a google search for “lexer”, “grammar” and the other things you
mentioned and I didn’t understand. But if you have good links for me I’d
love to get them.

Thanks again.

Cheers,

···

On Fri, Jan 23, 2004 at 06:08:37AM +0900, Mauricio Fernández wrote:

Daniel Carrera | No trees were harmed in the generation of this e-mail.
PhD student. | A significant number of electrons were, however, severely
Math Dept. UMD | inconvenienced.

Hi!

  • Daniel Carrera:

Where can I learn how to make good parsers?

There are a couple of books. A good one in english is

TI lex & yacc
AU JohnR. LEVINE, Tony MASON, Dough BROWN
ISBN 1-56592-000-7

It deals with lex (a Lexer) and yacc (Yet Another Compiler Compiler).
There is also a book in german by Helmut Herold with the same title,
ISBN is 3-8273-2096-8. By using them you not only learn the art
itself but also two nice tools.

But you do not necessarily have to go through that. It would be
enough to learn using EBNF:

::= ((‘+’ | ‘,’) )*
::= ( '') >
::= | | | | |
::= ‘month’ ‘s’?
::= ‘w’ (‘eek’ ‘s’?)?
::= ‘d’ (‘ay’ ‘s’?)?
::= ‘h’ (‘our’ ‘s’?)?
::= ‘m’ (‘in’ (‘ute’)? ‘s’?)?
::= (‘s’ (‘ec’ (‘ond’)? ‘s’?)?)?

Take a look at the ‘Parsing periods of time: Code and questions’
thread - implementation of is left to Ruby.

Josef ‘Jupp’ SCHUGT

···

On Fri, Jan 23, 2004 at 06:08:37AM +0900, Mauricio Ferna’ndez wrote:

http://oss.erdfunkstelle.de/ruby/ - German comp.lang.ruby-FAQ
http://rubyforge.org/users/jupp/ - Ruby projects at Rubyforge
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Germany 2004: To boldly spy where no GESTAPO / STASI has spied before

Looks like you’re doing the parser by hand…

Yes, because I am a parser-newbie and I don’t know better.

wouldn’t it be easier with, say, racc? As for the lexer [snip]

What’s racc? Do you have a link?

http://i.loveruby.net/en/racc.html
It’s a parser generator similar to YACC (kind of de facto standard) for
Ruby.

What’s a lexer?

Normally parsing is made in 2 steps. In the typical example of an
arithmetic expression
1 + 2
the lexer would break the input into tokens (syntactic atoms)
type value (semantic info)
NUMBER 1
OPPLUS
NUMBER 2
which would be passed to the parser, that would recognize the expression
according to a number of rules (productions).

Where can I learn how to make good parsers? I’d really like to do this
right.

There’s a billion books on this, essentially any on compiler
construction. Reading such a thing would probably be overkill, and you
can probably get by if you read the tutorial included in bison’s .info
documentation (bison is the GNU parser generator, compatible with yacc;
if you understand how to use it you can make use of racc similarly).

mmm I guess Coco/Rb could be a good option too, since you also get a
lexer, and LL(1) should be enough for this.

Links for Coco/Rb?

http://raa.ruby-lang.org/list.rhtml?name=coco-rb

You could also try Seattle.rb’s pure-Ruby port of Coco/R
zenspider projects | software projects | by ryan davis

Thanks for all the help! I just knew I was doing something wrong.
Thanks for pointing me the right direction.

I will do a google search for “lexer”, “grammar” and the other things you
mentioned and I didn’t understand. But if you have good links for me I’d
love to get them.

You can start here

and then proceed to the examples… Even though they’re in C, they
should be very helpful if you use racc later.

···

On Fri, Jan 23, 2004 at 06:27:45AM +0900, Daniel Carrera wrote:

On Fri, Jan 23, 2004 at 06:08:37AM +0900, Mauricio Fernández wrote:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Beeping is cute, if you are in the office :wink:
– Alan Cox