HTML Parsing?


(Martin Hart) #1

Hi all,

I need to access an http server and interpret som data from the page i get
back (basically for some minimal tests of a website). I know that I can use
the Net::HTTP class to connect and retrieve the page, but then I am left with
a string full of stuff.

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the

tag is.

I suppose using regexps is an option as well, but just wondering if I am
missing some cool library that already does all this stuff?

Thanks for any advice

Martin

···


Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com


(Dave Lee) #2

Martin Hart wrote:

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the

tag is.

see http://ruby-htmltools.rubyforge.org/

I used this library about a year ago, and found it pretty buggy.

Dave


(Emmanuel Touzery) #3

see the thread at
http://www.ruby-talk.org/cgi-bin/vframe.rb/ruby/ruby-talk/91265?91157-91621+split-mode-vertical

emmanuel

···

On Thursday 05 of February 2004 18:24, Martin Hart wrote:

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the
contents of a text control are - or what the caption of the

tag is.


(Gavin Sinclair) #4

For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.

Gavin

···

On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:

Martin Hart wrote:

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the

tag is.

see http://ruby-htmltools.rubyforge.org/

I used this library about a year ago, and found it pretty buggy.


(Martin Hart) #5

thanks for all the advice - I can’t believe that I missed the similar thread
started by Gavin only 4 days ago :frowning:

Sorry for the noise.

Cheers,
Martin

···

On Thursday 05 February 2004 21:02, Gavin Sinclair wrote:

On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:

see http://ruby-htmltools.rubyforge.org/

I used this library about a year ago, and found it pretty buggy.

For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.

Gavin


Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com


(Martin Hart) #6

OK feel free to call me an idiot here, but what versions of html-parser and
htmltools are you running?

I downloaded both the html-parser and the patched-html-parser from RAA which
installed themselves into site_ruby/ (not where i’d expect them -
site_ruby/1.8/…). I did this because htmltools appears to depend on one of
them - although not mentioned in the README (version 1.06)

Then I downloaded htmltools from rubyforge which first fails to install
because the sgml-parser.rb file is not in “html/sgml-parser” which is where
it is supposed(?) to be.

Anyway, after moving files to where I presume they should be installed to, the
htmltools library fails to install because the tests do not run (all 15 unit
tests fail with “NameError: uninitialized constant
HTML::TestStackingParser”).

My environment is ruby 1.8.1 linux.

My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?

Cheers,
Martin

···

On Friday 06 February 2004 12:40, Martin Hart wrote:

On Thursday 05 February 2004 21:02, Gavin Sinclair wrote:

On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:

see http://ruby-htmltools.rubyforge.org/

I used this library about a year ago, and found it pretty buggy.

For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.

Gavin


(Dave Lee) #7

Martin Hart wrote:

My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?

I had the same experience, so it’s not you, it’s the code. I had to tweak
some of the tests, and maybe even some of the code to get the tests to
pass and thus installation. sorry, but I don’t have the details recorded.

Dave


(Gavin Sinclair) #8

I got my stuff from http://bike-nomad.com/ruby/ and its linked
resources.

Cheers,
Gavin

···

On Saturday, February 7, 2004, 1:06:19 AM, Martin wrote:

On Friday 06 February 2004 12:40, Martin Hart wrote:

On Thursday 05 February 2004 21:02, Gavin Sinclair wrote:

On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:

see http://ruby-htmltools.rubyforge.org/

I used this library about a year ago, and found it pretty buggy.

For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.

Gavin

OK feel free to call me an idiot here, but what versions of html-parser and
htmltools are you running?


(daz) #9

(Martin Hart) #10

Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.

Just to append a note to the mini thread that started on packaging as a result
of this… while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA…

Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :slight_smile: when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.

Cheers,
Martin

···

On Sunday 08 February 2004 10:40, daz wrote:

It seems that Johannes’ idea is to include sgml-parser with the
updated htmltools library.
[snip]


Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com


(Gavin Sinclair) #11

True, but this is an isolated case. I’ve never seen so much
fragmentation with a Ruby library as I’ve seen with htmltools :slight_smile:

Since there is an htmltools project on RubyForge, that should become
the definitive one, once it’s ensured that it’s fully up to date.
I’ll be doing more HTML parsing fairly soon, so I’ll try to do my bit
in this area.

Cheers,
Gavin

···

On Sunday, February 8, 2004, 11:38:24 PM, Martin wrote:

On Sunday 08 February 2004 10:40, daz wrote:

It seems that Johannes’ idea is to include sgml-parser with the
updated htmltools library.
[snip]

Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.

Just to append a note to the mini thread that started on packaging as a result
of this… while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA…

Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :slight_smile: when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.


(daz) #12

“Martin Hart” wrote:

Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :slight_smile: when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.

Until I read this thread, I was unaware of 1.06 on RubyForge which is
an “updated for Ruby 1.8” version of 1.04 from RAA. ((garbage sentence))

This problem isn’t too common atm, but you’re right - this example is in a bit
of a mess. The issue is understood by those who matter.

Sorry we had to share the same inconvenience :slight_smile:

Cheers,

daz


#13

“daz” dooby@d10.karoo.co.uk wrote in message

Until I read this thread, I was unaware of 1.06 on RubyForge which is
an “updated for Ruby 1.8” version of 1.04 from RAA. ((garbage sentence))

This problem isn’t too common atm, but you’re right - this example is in a bit
of a mess. The issue is understood by those who matter.

Sorry we had to share the same inconvenience :slight_smile:

Thank you all for the feedback, and especially to daz for alerting me
directly (I haven’t paid attention to ruby-talk lately).

I have updated the tarball to include sgml-parser. Sorry about the
slip-up.

I will not have time to work much on the project for long. If anyone
wants to lend a hand, please speak up.

~Johannes


(daz) #14

“Johannes Brodwall” wrote:

[snip]
I have updated the tarball to include sgml-parser.

That’s greatly appreciated, Johannes, thank you for
this and for your previous updates to this library.

http://rubyforge.org/projects/ruby-htmltools/
(Version 1.07)

daz