Need help parsing HTML with Hpricot

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />

    I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
    Does anyone know how to do this with Hpricot?
    Thank you...

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
  puts "Line #{i}: #{c.to_s.strip}"
end

Produces:

Line 0: This is one line of text
Line 1: <br />
Line 2: This is another line of text
Line 3: <br />
Line 4: It keeps going on like this
Line 5: <br />
Line 6:
Line 7: <br />
Line 8: Until a new paragraph is started
Line 9: <br />
Line 10: Otherwise, it's just more of the same
Line 11: <br />
Line 12:

Hope that helps.

Mikel

···

On 10/25/07, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com> wrote:

    I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />

    I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
    Does anyone know how to do this with Hpricot?
    Thank you...

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
for examples and some better documentation. It helped me a lot to
solve my problems.

···

2007/10/25, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com>:

    I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

Of course... you could also do:

require 'rubygems'
require 'hpricot'

text =<<HERE
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
HERE

class String
  def not_needed?
    self.strip == "<br />" ? true : false
  end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
  line = c.to_s.strip
  if line == ""
    puts "<p>#{@paragraphs}</p>"
    @paragraphs.clear
  else
    @paragraphs << "#{line} " unless line.not_needed?
  end
end

Which produces:

<p>This is one line of text This is another line of text It keeps
going on like this </p>
<p>Until a new paragraph is started Otherwise, it's just more of the same </p>

Now... don't pick on my favorite HTML parser again! :smiley: Just ask nicely :slight_smile:

Mikel