On 17/12/2006, at 11:15 PM, Paul Lutus wrote:
> Henry Maddocks wrote:
>
>> Sorry, try again...
>>
>> Not sure where to send this, sorry if it's not the right place...
>>
>> The html in the attached file renders 'correctly' in the 3 browsers I
>> have tried but it tricks hpricot because of the second malformed
>> comment. When I say correctly I mean I get to see 'Some text'. I
>> guess it could be argued that this is incorrect. For my application
>> it would be nice if hpricot behaved like a browser.
Paul,
before I address your response directly I will say that I am aware of
your crusade against html parsing libraries and while I believe you
are entitled to your opinion, I disagree with it. I have done enough
of this sort of thing to know that, for me, the level of abstraction
that these libraries gives is both beneficial in development time and
maintenance. I am neither an html nuby, nor a ruby nuby. I am also
aware that my needs may not match those of some one else so I'm not
going to ram my opinions down there throat every time they ask for a
little help.
> You have created a new thread, and you have not attached any prior
> text.
> This requires us to start over.
As this is the first time I have posted on this subject, that much is
obvious. Unless I am missing something.
> Tell us what you hoped would happen, what happened instead, and how
> they
> differ.
Run the script and that too will be obvious.
> If your goal is to filter particular content from HTML pages, just
> say so,
> and be specific about what you want and don't want. Given this
> information,
> I will show you how to extract the desired content with a few lines of
> Ruby, no fuss, no undue complexity, no Hpricot.
My goal is to highlight an issue I found with a particular library
and provide some sample code that shows the problem with the minimum
amount of code. I posted it here so that there may be some discussion
with interested people as to the desired behaviour.
> IIRC, you had asked for help using Hpricot to extract text between
> <p> and
> </p> tag pairs, but with the added requirement that there be an IMG
> tag
> within the <p> ... </p> tag pair to validate the case. Is this
> still the
> goal? If so, how did my previously posted, simple solution work out
> for
> you?
What IMG tag? There isn't one in the sample code. What previous
solution? You do not recall correctly.
> This is a scene in a much larger play, one in which someone says,
> "Wow, I
> had no idea there was such a powerful library, so carefully
> designed, so
> complete. But, notwithstanding its extraordinary features,
> notwithstanding
> the hundreds of man-hours expended creating it ... I can't get it
> to do
> what I want."
The incident that that prompted my post went thus...
I had a page that seemed to render fine in a browser but when parsing
it my code failed. I inspected the html and found a malformed comment
to be the problem. Probably put there to stop screen scraping. I
wrote a bit of code, using regexps no less, that removed the
offending comment and hpricot then went on it's merry way. Job done.
I thought others may be interested so I posted some sample code. I am
now regretting that decision.
> This is a very common refrain. I think I can solve your problem
> with a few
> lines of Ruby code, code that you can easily understand and adapt to
> specific and evolving requirements. And if I cannot do this, I will
> say so.
I could too, but I don't care.
> --
> Paul Lutus
Thanks for hijacking my thread. Thanks for nothing.