I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you...
You can try each_child.
I will use each_child_with_index to show you what I mean:
Put your raw HTML text into @text
@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts "Line #{i}: #{c.to_s.strip}"
end
Produces:
Line 0: This is one line of text
Line 1: <br />
Line 2: This is another line of text
Line 3: <br />
Line 4: It keeps going on like this
Line 5: <br />
Line 6:
Line 7: <br />
Line 8: Until a new paragraph is started
Line 9: <br />
Line 10: Otherwise, it's just more of the same
Line 11: <br />
Line 12:
Hope that helps.
Mikel
···
On 10/25/07, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com> wrote:
I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you...
Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
for examples and some better documentation. It helped me a lot to
solve my problems.
···
2007/10/25, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com>:
I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:
Of course... you could also do:
require 'rubygems'
require 'hpricot'
text =<<HERE
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
HERE
class String
def not_needed?
self.strip == "<br />" ? true : false
end
end
@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == ""
puts "<p>#{@paragraphs}</p>"
@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end
Which produces:
<p>This is one line of text This is another line of text It keeps
going on like this </p>
<p>Until a new paragraph is started Otherwise, it's just more of the same </p>
Now... don't pick on my favorite HTML parser again!
Just ask nicely 
Mikel