I need to access an http server and interpret som data from the page i get
back (basically for some minimal tests of a website). I know that I can use
the Net::HTTP class to connect and retrieve the page, but then I am left with
a string full of stuff.
What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the
tag is.
I suppose using regexps is an option as well, but just wondering if I am
missing some cool library that already does all this stuff?
What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the
On Thursday 05 of February 2004 18:24, Martin Hart wrote:
What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the
contents of a text control are - or what the caption of the
For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.
Gavin
···
On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:
Martin Hart wrote:
What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the
OK feel free to call me an idiot here, but what versions of html-parser and
htmltools are you running?
I downloaded both the html-parser and the patched-html-parser from RAA which
installed themselves into site_ruby/ (not where i’d expect them -
site_ruby/1.8/…). I did this because htmltools appears to depend on one of
them - although not mentioned in the README (version 1.06)
Then I downloaded htmltools from rubyforge which first fails to install
because the sgml-parser.rb file is not in “html/sgml-parser” which is where
it is supposed(?) to be.
Anyway, after moving files to where I presume they should be installed to, the
htmltools library fails to install because the tests do not run (all 15 unit
tests fail with “NameError: uninitialized constant
HTML::TestStackingParser”).
My environment is ruby 1.8.1 linux.
My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?
Cheers,
Martin
···
On Friday 06 February 2004 12:40, Martin Hart wrote:
On Thursday 05 February 2004 21:02, Gavin Sinclair wrote:
On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:
My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?
I had the same experience, so it’s not you, it’s the code. I had to tweak
some of the tests, and maybe even some of the code to get the tests to
pass and thus installation. sorry, but I don’t have the details recorded.
Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.
Just to append a note to the mini thread that started on packaging as a result
of this… while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA…
Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.
Cheers,
Martin
···
On Sunday 08 February 2004 10:40, daz wrote:
It seems that Johannes’ idea is to include sgml-parser with the
updated htmltools library.
[snip]
True, but this is an isolated case. I’ve never seen so much
fragmentation with a Ruby library as I’ve seen with htmltools
Since there is an htmltools project on RubyForge, that should become
the definitive one, once it’s ensured that it’s fully up to date.
I’ll be doing more HTML parsing fairly soon, so I’ll try to do my bit
in this area.
Cheers,
Gavin
···
On Sunday, February 8, 2004, 11:38:24 PM, Martin wrote:
On Sunday 08 February 2004 10:40, daz wrote:
It seems that Johannes’ idea is to include sgml-parser with the
updated htmltools library.
[snip]
Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.
Just to append a note to the mini thread that started on packaging as a result
of this… while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA…
Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.
Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don’t know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.
Until I read this thread, I was unaware of 1.06 on RubyForge which is
an “updated for Ruby 1.8” version of 1.04 from RAA. ((garbage sentence))
This problem isn’t too common atm, but you’re right - this example is in a bit
of a mess. The issue is understood by those who matter.