Htmltokenizer bug?

Horacio_Sanson1 · 28 November 2005 09:32

I am using htmltokenizer to extract the links of some web pages, my script
worked perfectly until I started to parse pages with "<" and ">" chars in the
text.

a html string like this

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Dick_Davies · 28 November 2005 10:29

I think most *browsers* would choke on that

Have you tried using entities instead ?

( < instead of < and > instead of >)

···

On 28/11/05, Horacio Sanson <hsanson@moegi.waseda.jp> wrote:

I am using htmltokenizer to extract the links of some web pages, my script
worked perfectly until I started to parse pages with "<" and ">" chars in the
text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Daniel_Schierbeck · 28 November 2005 12:52

Horacio Sanson wrote:

I am using htmltokenizer to extract the links of some web pages, my script worked perfectly until I started to parse pages with "<" and ">" chars in the text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Your HTML isn't valid. Either use the proper entities (< = < and > = >) or make a CDATA section, though the latter isn't really that well-supported in most browsers.

Cheers,
Daniel

Horacio_Sanson1 · 28 November 2005 13:40

Well the problem is that this HTML is not mine, retrieving the pages from the
Internet.

Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52、Daniel Schierbeck さんは書きました:

···

Horacio Sanson wrote:
> I am using htmltokenizer to extract the links of some web pages, my
> script worked perfectly until I started to parse pages with "<" and ">"
> chars in the text.
>
> a html string like this
>
> <a href="an_uri" > this is a <link> </a>
>
> causes the htmlparser to raise and exception; Error, tag is nil....
>
>
> Is there a patch or any way to make htmlparser to parse this text??
>
>
> regards,
> Horacio

Your HTML isn't valid. Either use the proper entities (< = < and > =
>) or make a CDATA section, though the latter isn't really that
well-supported in most browsers.

<a href="an_uri"><![CDATA[this is a <link>]]></a>

Cheers,
Daniel

Daniel_Amelang · 2 December 2005 01:17

Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

Dan

···

On 11/28/05, Horacio Sanson <hsanson@moegi.waseda.jp> wrote:

Well the problem is that this HTML is not mine, retrieving the pages from the
Internet.

Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52、Daniel Schierbeck さんは書きました:
> Horacio Sanson wrote:
> > I am using htmltokenizer to extract the links of some web pages, my
> > script worked perfectly until I started to parse pages with "<" and ">"
> > chars in the text.
> >
> > a html string like this
> >
> > <a href="an_uri" > this is a <link> </a>
> >
> > causes the htmlparser to raise and exception; Error, tag is nil....
> >
> >
> > Is there a patch or any way to make htmlparser to parse this text??
> >
> >
> > regards,
> > Horacio
>
> Your HTML isn't valid. Either use the proper entities (< = < and > =
> >) or make a CDATA section, though the latter isn't really that
> well-supported in most browsers.
>
> <a href="an_uri"><![CDATA[this is a <link>]]></a>
>
>
> Cheers,
> Daniel

James_Britt4 · 2 December 2005 02:29

Daniel Amelang wrote:

Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

Rubyful Soup: "The brush has got entangled in it!"

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

I recently tried using RubyfulSoup to parse a Web page, and it had some peculiar behavior, such as stripping all attributes. Either I was not using it correctly, or it was a bit too casual in making sense of the input.

I ended up using some crude string parsing to extract just the subset of the page I wanted, which gave me well-formed XML suitable for REXML manipulation. I got a phenomenal speed increase from that as well; RubyfulSoup seems quite slow.

James

···

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

Topic		Replies	Views
Help help please help ruby-talk	1	78	27 October 2005
Parse page using tokenizer ruby-talk	0	69	22 October 2005
Problem with URI.parse ruby-talk	2	112	26 April 2006
Gathering Links ruby-talk	1	84	19 January 2006
Stripping unwanted html ruby-talk	6	77	9 October 2006

Htmltokenizer bug?

Related topics