Need help parsing HTML with Hpricot

Just_Another_Victim1 · 25 October 2007 07:00

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

This is one line of text 
This is another line of text 
It keeps going on like this 
 
Until a new paragraph is started 
Otherwise, it's just more of the same

    I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
    Does anyone know how to do this with Hpricot?
    Thank you...

Mikel_Lindsaar1 · 25 October 2007 07:37

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts "Line #{i}: #{c.to_s.strip}"
end

Produces:

Line 0: This is one line of text
Line 1: 
Line 2: This is another line of text
Line 3: 
Line 4: It keeps going on like this
Line 5: 
Line 6:
Line 7: 
Line 8: Until a new paragraph is started
Line 9: 
Line 10: Otherwise, it's just more of the same
Line 11: 
Line 12:

Hope that helps.

Mikel

···

On 10/25/07, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com> wrote:

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

This is one line of text 
This is another line of text 
It keeps going on like this 
 
Until a new paragraph is started 
Otherwise, it's just more of the same 

 I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
 Does anyone know how to do this with Hpricot?
 Thank you...

Thomas_Wieczorek · 25 October 2007 07:47

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
for examples and some better documentation. It helped me a lot to
solve my problems.

···

2007/10/25, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com>:

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

Mikel_Lindsaar1 · 25 October 2007 07:49

Of course... you could also do:

require 'rubygems'
require 'hpricot'

text =<<HERE
This is one line of text 
This is another line of text 
It keeps going on like this 
 
Until a new paragraph is started 
Otherwise, it's just more of the same 
HERE

class String
 def not_needed?
 self.strip == " " ? true : false
 end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
 line = c.to_s.strip
 if line == ""
 puts "#{@paragraphs}"
 @paragraphs.clear
 else
 @paragraphs << "#{line} " unless line.not_needed?
 end
end

Which produces:

This is one line of text This is another line of text It keeps
going on like this 
Until a new paragraph is started Otherwise, it's just more of the same

Now... don't pick on my favorite HTML parser again! Just ask nicely

Mikel

Topic		Replies	Views
HTML parser using Hpricot ruby-talk	0	83	8 January 2010
HTML parser Hpricot? and how to get all text ruby-talk	12	131	3 November 2007
Html parsing with Hpricot ruby-talk	2	83	9 June 2010
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser ruby-talk	0	119	16 June 2007
Hpricot question ruby-talk	0	77	30 January 2008

Need help parsing HTML with Hpricot

Related topics