Yet another Hpricot question

Rick_DeNatale1 · 11 October 2006 17:13

I'm trying to scan an html file using Hpricot to produce a table of
links within the file.

Right now I've got something like this.

   doc = Hpricot(open(url)).
   doc.search('a').each do | element |
         puts "#{element.inner_html}
         puts " #{element.attributes['href']
   end

This works, but in this document some of the a tags use markup on
their contents. Something like
<a href="http://blah.org/blah.htm">blah blah blah</a>

I'd like to strip out the markup tags so that I'd get

blah blah blah
http://blah.org/blah.htm

Is there some way to search for or iterate over the leaf elements of
the tree rooted by an element in Hpricot?

···

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Aaron_Patterson2 · 11 October 2006 17:28

I had to do something similar in Mechanize, and this is what I came up
with:

class Hpricot::Elem
 def all_text
 text = ''
 children.each do |child|
 if child.respond_to? :content
 text << child.content
 end
 if child.respond_to? :all_text
 text << child.all_text
 end
 end
 text
 end
 end

 doc = Hpricot("<a href=\"http://blah.org/blah.htm\">blah blah blah</a>")
 doc.search('a').each do |e|
 puts "#{e.all_text}"
 puts " #{e.attributes['href']}"
 end

Hope that helps!

--Aaron

···

On Thu, Oct 12, 2006 at 02:13:07AM +0900, Rick DeNatale wrote:

I'm trying to scan an html file using Hpricot to produce a table of
links within the file.

Right now I've got something like this.

 doc = Hpricot(open(url)).
 doc.search('a').each do | element |
 puts "#{element.inner_html}
 puts " #{element.attributes['href']
 end

This works, but in this document some of the a tags use markup on
their contents. Something like
<a href="http://blah.org/blah.htm">blah blah blah</a>

I'd like to strip out the markup tags so that I'd get

blah blah blah
 http://blah.org/blah.htm

Is there some way to search for or iterate over the leaf elements of
the tree rooted by an element in Hpricot?

--
Aaron Patterson
http://tenderlovemaking.com/

Gregory_Seidman · 11 October 2006 17:49

} On Thu, Oct 12, 2006 at 02:13:07AM +0900, Rick DeNatale wrote:
} > I'm trying to scan an html file using Hpricot to produce a table of
} > links within the file.
} >
} > Right now I've got something like this.
} >
} > doc = Hpricot(open(url)).
} > doc.search('a').each do | element |
} > puts "#{element.inner_html}
} > puts " #{element.attributes['href']
} > end
} >
} > This works, but in this document some of the a tags use markup on
} > their contents. Something like
} > <a href="http://blah.org/blah.htm">blah blah blah</a>
} >
} > I'd like to strip out the markup tags so that I'd get
} >
} > blah blah blah
} > http://blah.org/blah.htm
} >
} > Is there some way to search for or iterate over the leaf elements of
} > the tree rooted by an element in Hpricot?
}
} I had to do something similar in Mechanize, and this is what I came up
} with:
}
} class Hpricot::Elem
} def all_text
} text = ''
} children.each do |child|
} if child.respond_to? :content
} text << child.content
} end
} if child.respond_to? :all_text
} text << child.all_text
} end
} end
} text
} end
} end
}
} doc = Hpricot("<a href=\"http://blah.org/blah.htm">blah blah blah</a>")
} doc.search('a').each do |e|
} puts "#{e.all_text}"
} puts " #{e.attributes['href']}"
} end

There is a simpler implementation of all_text:

class Hpricot::Elem
 def all_text
 text = ''
 traverse_text {|t| text << t.content }
 text
 end
end

} Hope that helps!
} --Aaron
--Greg

···

On Thu, Oct 12, 2006 at 02:28:13AM +0900, Aaron Patterson wrote:

Rick_DeNatale1 · 12 October 2006 13:29

Thanks Aaron and Greg, works a treat!

···

On 10/11/06, Gregory Seidman <gsslist+ruby@anthropohedron.net> wrote:

On Thu, Oct 12, 2006 at 02:28:13AM +0900, Aaron Patterson wrote:
} On Thu, Oct 12, 2006 at 02:13:07AM +0900, Rick DeNatale wrote:
} > I'm trying to scan an html file using Hpricot to produce a table of
} > links within the file.
} >
} > Right now I've got something like this.
} >
} > doc = Hpricot(open(url)).
} > doc.search('a').each do | element |
} > puts "#{element.inner_html}
} > puts " #{element.attributes['href']
} > end
} >
} > This works, but in this document some of the a tags use markup on
} > their contents. Something like
} > <a href="http://blah.org/blah.htm">blah blah blah</a>
} >
} > I'd like to strip out the markup tags so that I'd get
} >
} > blah blah blah
} > http://blah.org/blah.htm
} >
} > Is there some way to search for or iterate over the leaf elements of
} > the tree rooted by an element in Hpricot?
}
} I had to do something similar in Mechanize, and this is what I came up
} with:
}
} class Hpricot::Elem
} def all_text
} text = ''
} children.each do |child|
} if child.respond_to? :content
} text << child.content
} end
} if child.respond_to? :all_text
} text << child.all_text
} end
} end
} text
} end
} end
}
} doc = Hpricot("<a href=\"http://blah.org/blah.htm\">blah blah blah</a>")
} doc.search('a').each do |e|
} puts "#{e.all_text}"
} puts " #{e.attributes['href']}"
} end

There is a simpler implementation of all_text:

class Hpricot::Elem
 def all_text
 text = ''
 traverse_text {|t| text << t.content }
 text
 end
end

} Hope that helps!
} --Aaron
--Greg

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

why_the_lucky_stiff1 · 12 October 2006 17:38

If three of you have independantly used this, let's check it in. Elements#text
which parallels Elements#html (I think jQuery also has this) and Elem#inner_text
as well.

_why

···

On Thu, Oct 12, 2006 at 10:29:54PM +0900, Rick DeNatale wrote:

>There is a simpler implementation of all_text:
>
>class Hpricot::Elem
> def all_text
> text = ''
> traverse_text {|t| text << t.content }
> text
> end
>end

Thanks Aaron and Greg, works a treat!

Gregory_Seidman · 12 October 2006 22:52

} On Thu, Oct 12, 2006 at 10:29:54PM +0900, Rick DeNatale wrote:
} > >There is a simpler implementation of all_text:
} > >
} > >class Hpricot::Elem
} > > def all_text
} > > text = ''
} > > traverse_text {|t| text << t.content }
} > > text
} > > end
} > >end
} >
} > Thanks Aaron and Greg, works a treat!
}
} If three of you have independantly used this, let's check it in.
} Elements#text which parallels Elements#html (I think jQuery also has
} this) and Elem#inner_text as well.

Actually, I haven't used it. I just knew how to go about it. What I would
find much more useful is an #inject_text (and a corresponding
#inject_elements, though that isn't as important since the / notation
retrieves an Enumerable).

} _why
--Greg

···

On Fri, Oct 13, 2006 at 02:38:10AM +0900, why the lucky stiff wrote:

Topic		Replies	Views
Scan HTML ruby-talk	15	80	3 March 2008
Hpricot getting a table ruby-talk	4	67	18 April 2007
HTML parser using Hpricot ruby-talk	0	83	8 January 2010
Html parsing with Hpricot ruby-talk	2	83	9 June 2010
Hpricot search help ruby-talk	2	63	17 March 2008

Yet another Hpricot question

Related topics