Hpricot problem

Henry_Maddocks · 17 December 2006 09:55

Sorry, try again...

Not sure where to send this, sorry if it's not the right place...

The html in the attached file renders 'correctly' in the 3 browsers I have tried but it tricks hpricot because of the second malformed comment. When I say correctly I mean I get to see 'Some text'. I guess it could be argued that this is incorrect. For my application it would be nice if hpricot behaved like a browser.

Henry

hpricot_comment_test.rb (192 Bytes)

Paul_Lutus · 17 December 2006 10:15

Henry Maddocks wrote:

Sorry, try again...

Not sure where to send this, sorry if it's not the right place...

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see 'Some text'. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

You have created a new thread, and you have not attached any prior text.
This requires us to start over.

Tell us what you hoped would happen, what happened instead, and how they
differ.

If your goal is to filter particular content from HTML pages, just say so,
and be specific about what you want and don't want. Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

IIRC, you had asked for help using Hpricot to extract text between and
 tag pairs, but with the added requirement that there be an IMG tag
within the ... tag pair to validate the case. Is this still the
goal? If so, how did my previously posted, simple solution work out for
you?

This is a scene in a much larger play, one in which someone says, "Wow, I
had no idea there was such a powerful library, so carefully designed, so
complete. But, notwithstanding its extraordinary features, notwithstanding
the hundreds of man-hours expended creating it ... I can't get it to do
what I want."

This is a very common refrain. I think I can solve your problem with a few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will say so.

···

--
Paul Lutus
http://www.arachnoid.com

Peter_Szinek3 · 17 December 2006 13:44

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see 'Some text'. I guess
it could be argued that this is incorrect.

What are you trying to do? Matching that comment? Or matching the text
'Some text'? Which version of Hpricot do you use (svn head or 0.4)? What
exactly is the problem?

For my application it would be nice if hpricot behaved like a browser.

Well, if this is the goal, then use a browser :-). Hpricot is not a
browser and it does not try to be one.

I am working on a project with Java where we are using Mozilla/FireFox
XULRunner to parse the HTML (and to communicate with FF) and it's
really, really robust and fast and reliable and and and. However, AFAIK
this is not doable in Ruby ATM (I would be really happy if it would be,
but from what I have seen it's not - there was some initial try to
implement rbXPCOM, but it was abandoned in 2001). Maybe some other
browser (safari, opera?)

Btw. which feature of 'browser-like'-ness would you like to use? What
are your exact requirements?

Peter

···

__
http://www.rubyrailways.com

_why · 18 December 2006 06:16

Great stuff! Thankyou. This is going to be a fun one to work on, so I'll get
back to you when I've got the medicine.

_why

···

On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry Maddocks wrote:

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment.

Peter_Szinek3 · 17 December 2006 10:51

Hello,

Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

Why should it be complicated? What fuss? Who needs few lines? With the
current version of hpricot this is exactly one line:

doc//p[img]//text()

This is a scene in a much larger play, one in which someone says, >

"Wow, I

had no idea there was such a powerful library, so carefully designed,

> so

complete. But, notwithstanding its extraordinary features, > > > > > >
notwithstanding
the hundreds of man-hours expended creating it ... I can't get it to >
do what I want."

You know, software is an evolving stuff. 3 (or 4, or something like
this) days ago the above stuff was not available in HPricot, and since
it was such a common query, and requested by people. voila: now it is
there.

Of course there will always be some missing features - no framework or
library can solve all the problems of all mankind - but after some time,
useful feedback (i.e. not 'forget about every framework since you can do
it in a few lines of Ruby' but rather feature requests, bug reports etc)
a framework can reach a maturity level where is solves most of the
problems of its users.

Btw. ever heard of 'reinventing the wheel'?

Also your (otherwise great) code snippets always assume that the
underlying HTML is well formed, and x and y and z - which is in real
life almost never the case. Of course the posters here are not pasting
200K of HTML against which they run they production code, but a few
lines of example which is usually an oversimplification of the problem.

This another point where such libraries are great: they handle 844747
special cases (if your case is not among them, see the current-2nd
paragraph, or add it there on your own) which is always a problematic
thing in case of hand written stuff.

I could state here 100 another points which would prove that in
production, libraries are almost always better choice over hand written
code on the fly - of course learning Ruby, playing with some features
etc is another thing. I am not arguing that in this case one should not
code everything on his own. However, there are some cases when people
need a stable, working solution for something and don't want to play
around with hand coded regexps against crappy HTML. In this case, IMHO,
using a framework is absolutely OK.

Cheers,
Peter

···

__
http://www.rubyrailways.com

Best wishes,
Peter

__
http://www.rubyrailways.com

Henry_Maddocks · 18 December 2006 06:16

Henry Maddocks wrote:

Sorry, try again...

Not sure where to send this, sorry if it's not the right place...

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see 'Some text'. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

Paul,

before I address your response directly I will say that I am aware of your crusade against html parsing libraries and while I believe you are entitled to your opinion, I disagree with it. I have done enough of this sort of thing to know that, for me, the level of abstraction that these libraries gives is both beneficial in development time and maintenance. I am neither an html nuby, nor a ruby nuby. I am also aware that my needs may not match those of some one else so I'm not going to ram my opinions down there throat every time they ask for a little help.

You have created a new thread, and you have not attached any prior text.
This requires us to start over.

As this is the first time I have posted on this subject, that much is obvious. Unless I am missing something.

Tell us what you hoped would happen, what happened instead, and how they
differ.

Run the script and that too will be obvious.

If your goal is to filter particular content from HTML pages, just say so,
and be specific about what you want and don't want. Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

My goal is to highlight an issue I found with a particular library and provide some sample code that shows the problem with the minimum amount of code. I posted it here so that there may be some discussion with interested people as to the desired behaviour.

IIRC, you had asked for help using Hpricot to extract text between and
 tag pairs, but with the added requirement that there be an IMG tag
within the ... tag pair to validate the case. Is this still the
goal? If so, how did my previously posted, simple solution work out for
you?

What IMG tag? There isn't one in the sample code. What previous solution? You do not recall correctly.

This is a scene in a much larger play, one in which someone says, "Wow, I
had no idea there was such a powerful library, so carefully designed, so
complete. But, notwithstanding its extraordinary features, notwithstanding
the hundreds of man-hours expended creating it ... I can't get it to do
what I want."

The incident that that prompted my post went thus...
I had a page that seemed to render fine in a browser but when parsing it my code failed. I inspected the html and found a malformed comment to be the problem. Probably put there to stop screen scraping. I wrote a bit of code, using regexps no less, that removed the offending comment and hpricot then went on it's merry way. Job done.
I thought others may be interested so I posted some sample code. I am now regretting that decision.

This is a very common refrain. I think I can solve your problem with a few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will say so.

I could too, but I don't care.

--
Paul Lutus

Thanks for hijacking my thread. Thanks for nothing.

···

On 17/12/2006, at 11:15 PM, Paul Lutus wrote:

Henry_Maddocks · 18 December 2006 09:03

It's not a big deal. Like I said, it's easy to work around. Just thought you'd like to know.

···

On 18/12/2006, at 7:16 PM, _why wrote:

On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry Maddocks wrote:

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment.

Great stuff! Thankyou. This is going to be a fun one to work on, so I'll get
back to you when I've got the medicine.

Paul_Lutus · 17 December 2006 19:35

Peter Szinek wrote:

/ ...

Btw. ever heard of 'reinventing the wheel'?

I don't generally reinvent the wheel until the existing wheel breaks. This
is one of those cases.

Also your (otherwise great) code snippets always assume that the
underlying HTML is well formed, and x and y and z - which is in real
life almost never the case.

Yes, true, my code is typically quite fragile and can only handle
essentially perfect HTML, and I generally offer that exact warning.
Ironically, though, in this case, my naive solution parsed the HTML that
caused Hpricot to fail.

Of course the posters here are not pasting
200K of HTML against which they run they production code, but a few
lines of example which is usually an oversimplification of the problem.

Almost always. But in this case Hpricot failed on the provided short
example, with a single deviant tag syntax.

This another point where such libraries are great: they handle 844747
special cases (if your case is not among them, see the current-2nd
paragraph, or add it there on your own) which is always a problematic
thing in case of hand written stuff.

Absolutely. I don't generally post my offer of a few lines of code unless
and until a library has failed. In this case, it failed.

I could state here 100 another points which would prove that in
production, libraries are almost always better choice over hand written
code on the fly -

Yes, unfortunately none of them would successfully answer this OP's call
from the real world. Libraries are the obvious solution to this kind of
task. They have everything going for them, up to, but not including, the
moment when they fail to meet the user's requirements.

I have to say that I see a lot of posts that follow this pattern. The
library seems to be able to solve any number of difficult problems except
the specific problem the user happens to be facing.

And my typical offered, simple solution is not meant to, and cannot stand in
for, the 2^32 special cases that have been laboriously programmed into the
library. It can only provide an overlooked special need that the library
cannot provide. It's surprising to me how often this happens.

···

--
Paul Lutus
http://www.arachnoid.com

Henry_Maddocks · 18 December 2006 06:16

Maybe I'm going mad but there is no img tag in the sample code. I am not interested in extracting anything. I know how to do that. I am trying to highlight a problem I discovered in hpricot.

···

On 17/12/2006, at 11:51 PM, Peter Szinek wrote:

Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

Why should it be complicated? What fuss? Who needs few lines? With the
current version of hpricot this is exactly one line:

doc//p[img]//text()

Chris_Carter · 18 December 2006 13:01

Henry, There was some just a few days ago who had a problem with using
Hpricot, and IMG elements in P tags. Paul must have gotten you two
confused.

···

On 12/18/06, Henry Maddocks <henryj@paradise.net.nz> wrote:

On 17/12/2006, at 11:15 PM, Paul Lutus wrote:

> Henry Maddocks wrote:
>
>> Sorry, try again...
>>
>> Not sure where to send this, sorry if it's not the right place...
>>
>> The html in the attached file renders 'correctly' in the 3 browsers I
>> have tried but it tricks hpricot because of the second malformed
>> comment. When I say correctly I mean I get to see 'Some text'. I
>> guess it could be argued that this is incorrect. For my application
>> it would be nice if hpricot behaved like a browser.

Paul,

before I address your response directly I will say that I am aware of
your crusade against html parsing libraries and while I believe you
are entitled to your opinion, I disagree with it. I have done enough
of this sort of thing to know that, for me, the level of abstraction
that these libraries gives is both beneficial in development time and
maintenance. I am neither an html nuby, nor a ruby nuby. I am also
aware that my needs may not match those of some one else so I'm not
going to ram my opinions down there throat every time they ask for a
little help.

> You have created a new thread, and you have not attached any prior
> text.
> This requires us to start over.

As this is the first time I have posted on this subject, that much is
obvious. Unless I am missing something.

> Tell us what you hoped would happen, what happened instead, and how
> they
> differ.

Run the script and that too will be obvious.

> If your goal is to filter particular content from HTML pages, just
> say so,
> and be specific about what you want and don't want. Given this
> information,
> I will show you how to extract the desired content with a few lines of
> Ruby, no fuss, no undue complexity, no Hpricot.

My goal is to highlight an issue I found with a particular library
and provide some sample code that shows the problem with the minimum
amount of code. I posted it here so that there may be some discussion
with interested people as to the desired behaviour.

> IIRC, you had asked for help using Hpricot to extract text between
> and
> tag pairs, but with the added requirement that there be an IMG
> tag
> within the ... tag pair to validate the case. Is this
> still the
> goal? If so, how did my previously posted, simple solution work out
> for
> you?

What IMG tag? There isn't one in the sample code. What previous
solution? You do not recall correctly.

> This is a scene in a much larger play, one in which someone says,
> "Wow, I
> had no idea there was such a powerful library, so carefully
> designed, so
> complete. But, notwithstanding its extraordinary features,
> notwithstanding
> the hundreds of man-hours expended creating it ... I can't get it
> to do
> what I want."

The incident that that prompted my post went thus...
I had a page that seemed to render fine in a browser but when parsing
it my code failed. I inspected the html and found a malformed comment
to be the problem. Probably put there to stop screen scraping. I
wrote a bit of code, using regexps no less, that removed the
offending comment and hpricot then went on it's merry way. Job done.
I thought others may be interested so I posted some sample code. I am
now regretting that decision.

> This is a very common refrain. I think I can solve your problem
> with a few
> lines of Ruby code, code that you can easily understand and adapt to
> specific and evolving requirements. And if I cannot do this, I will
> say so.

I could too, but I don't care.

> --
> Paul Lutus

Thanks for hijacking my thread. Thanks for nothing.

--
Chris Carter
concentrationstudios.com
brynmawrcs.com

Henry_Maddocks · 18 December 2006 06:17

Yes, unfortunately none of them would successfully answer this OP's call
from the real world. Libraries are the obvious solution to this kind of
task. They have everything going for them, up to, but not including, the
moment when they fail to meet the user's requirements.

I have to say that I see a lot of posts that follow this pattern. The
library seems to be able to solve any number of difficult problems except
the specific problem the user happens to be facing.

Every solution works up until the point that it doesn't. If life wasn't like that we wouldn't have much to.

And my typical offered, simple solution is not meant to, and cannot stand in
for, the 2^32 special cases that have been laboriously programmed into the
library. It can only provide an overlooked special need that the library
cannot provide. It's surprising to me how often this happens.

Which is why I posted my test case. To knock one more special case off the list.

···

On 18/12/2006, at 8:35 AM, Paul Lutus wrote:

Topic		Replies	Views
Hpricot problem ruby-talk	4	67	18 December 2006
Hpricot question ruby-talk	0	77	30 January 2008
[ANN] hpricot 0.8 ruby-talk	0	106	1 April 2009
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser ruby-talk	0	119	16 June 2007
[ANN] hpricot 0.5 -- a fast, forgiving HTML reader ruby-talk	7	113	11 May 2007

Hpricot problem

Related topics