Simple parsing of sloppy HTML - LittleLexer


(John Carter) #1

There have been a couple of threads on parsing HTML.

If you looking for something excruciatingly simple (and simplistic) that
will parse the sloppiest of HTML pages, you could do worse than look at
LittleLexer.

http://littlelexer.rubyforge.org/

At a mere 44 lines of non-blank non-comment lines of Ruby, it has the
virtue of a certain elegant simplicity.

It includes a rudimentary HTML parser as an example.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.


(Laurent Julliard) #2

I wonder if you’re going to add some more documentation. It seem that
littlelexer is someway interesting, but I could not understand it last
time I looked at it (note that I’m stupid)

···

il Wed, 11 Feb 2004 10:59:41 +0900, John Carter john.carter@tait.co.nz ha scritto::

http://littlelexer.rubyforge.org/

At a mere 44 lines of non-blank non-comment lines of Ruby, it has the
virtue of a certain elegant simplicity.

It includes a rudimentary HTML parser as an example.


(Clifford Heath) #3

John Carter wrote:

http://littlelexer.rubyforge.org/

Cute, but it’s not a parser, it’s a composite lexer.
A regular expression engine cannot parse context-free grammars (like
most programming languages), even if you can layer them like this.


(John Carter) #4

Yes and no.

It’s a while since I did the theory courses so forgive me if my
details as to whether something is LR or LL is wrong…

Note 1: If I remember correctly even a full “parser generator” like yacc/Bison
cannot recognize an arbitrary CFG. They only recognize a limited subset
like LL(1)

Thus LittleLexer is a parser in the same sense that yacc/Bison are,
but the subset of CFG’s it can recognize is more limited.

Note 2: The fundamental limitation on regex’s as a parser is they
don’t cope with nested structures.

So as a mental exercise I was trying to visualize how I would get
LittleLexer to chew on that simplest and best poster child example of
a “nested” language, s-expressions. (Lisp like expressions)

The answer is quite simple in that case, have a lexer pass, no surprise.
And then have a pass that recognizes unbracketed regions.
%r{([^()]*)} maps to ‘f’

eg.

(+ (- 30 20) 10)

Feed it to the lexer and it outputs…
’(+(-dd)d)’, [’(’,’+’,’(’,’-’,‘30’,‘20’,’)’,‘10’,’)’]

Feed it to the parser and it produces…

‘(+fd)’, [’(’,’+’,’(-dd)’,‘d’,’)’]

Feed it back into exactly the same little lexer instance and you have…
‘f’, ‘[’(+fd)’]

Now there is an option on the LittleLexer scan method that allows you
to pass in the resulting token list from the previous pass so what you
get is a AST (Abstract Syntax Tree) of the s-expression. Voila!
LittleLexer parsed a language that is not a regular expression.

The LispParser as a proof of concept is easy to do. What I want to do
for version 2 of LitteLexer is generalize the notion of handling
nested structures like that. But since the whole thing is a “late
night, can’t sleep” project, I’m stilling mulling over what I mean by
"nested structure".

For example consider Ruby if/else expressions…

if a
dostuff
if b
do_more_stuff
else
dothing
end
end

Would map to something like…
if => 'i’
statement => 's’
else => 'l’
end => 'e’
expression => ‘x’

So that would be…
‘ixsixslsee’

So we could have a regex that recognized “atomic” if/else statements…
[/ixs+(ls)?e/, ‘s’]

For example, a LL(1) parser must be able to decide what path it is
going down by one token of look ahead. In a sense this iterative
approach I’m using is a bit better off. I can use the entire string to
decide whether I have an atomic item or a nested item.

So why doesn’t Bison and the like do this? Because it is slow, the
whole art and science of compiler compiler’s evolved in days of very
slow computers trying to do far more than they had power for.

On the other hand it is not too slow as Ruby’s regex engine is really
very good and I’m only manipulating strings and pointers which are
typically an order of magnitude smaller than the original file.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.

···

On Thu, 12 Feb 2004, Clifford Heath wrote:

John Carter wrote:

http://littlelexer.rubyforge.org/

Cute, but it’s not a parser, it’s a composite lexer.
A regular expression engine cannot parse context-free grammars (like
most programming languages), even if you can layer them like this.


(Clifford Heath) #5

Regular expressions can’t parse reentrant definitions, for example
stmt ::= while expr stmt | if expr stmt | …

Sure you can wrap them up in loops and other things, and that’s
what parsers do when they wrap themselves around a lexer, but
you must have a layer that is not a regular expression, or it
doesn’t work. This has been mathematically proven from the
definition of RE, so if you find a way around it, you don’t
have an RE.

So we could have a regex that recognized “atomic” if/else statements…

Yes, but if the if/else statement contains another, and that contains
another, to arbitrary depth, then no RE can parse it, with any amount
of look-ahead.


(John Carter) #6

Correct, and the fact that I can and do parse lisp via this trick does
demonstrate that I can recognize a larger set than a RE.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.

···

On Fri, 13 Feb 2004, Clifford Heath wrote:

Sure you can wrap them up in loops and other things, and that’s
what parsers do when they wrap themselves around a lexer, but
you must have a layer that is not a regular expression, or it
doesn’t work. This has been mathematically proven from the
definition of RE, so if you find a way around it, you don’t
have an RE.