Hpricot parsing

Marc_Farber · 19 April 2009 16:12

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx

···

--
Posted via http://www.ruby-forum.com/.

7stud · 19 April 2009 23:20

Marc Farber wrote:

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div..
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
  <h3>name of funeral home</h3>
  <p>deceased1</p>
  <div>advertising crap</div>
  <h3>funeral home 2</h3>
  <p>deceased 2</p>
  <p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

    puts h3.inner_text
    puts "\t #{elm.inner_text}"
  end

end

--output:--
name of funeral home
         deceased1
funeral home 2
         deceased 2
funeral home 2
         deceased 3

···

--
Posted via http://www.ruby-forum.com/\.

Wang_Jian · 20 April 2009 02:04

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.
I'd also be glad to know.

···

2009/4/20 Marc Farber <mrcfab3@gmail.com>

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx
--
Posted via http://www.ruby-forum.com/\.

7stud · 19 April 2009 23:40

7stud -- wrote:

h3_tags.each do |h3|
  elm = h3

  while elm = elm.next_sibling
    break if elm.name != 'p'

    puts h3.inner_text
    puts "\t #{elm.inner_text}"
  end

end

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != 'p'

    puts funeral_home
    puts "\t #{elm.inner_text}"
  end
end

···

--
Posted via http://www.ruby-forum.com/\.

Marc_Farber · 19 April 2009 23:50

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc

···

--
Posted via http://www.ruby-forum.com/.

Phlip1 · 20 April 2009 02:25

Wang Jian wrote:

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.

Try to write it. I hope I'm wrong, but I suspect that starting will be easy, and hitting your own target XML will be easy...

...but making it generic enough to publish will be another story!

···

--
Phlip

Topic		Replies	Views
Hpricot getting a table ruby-talk	4	67	18 April 2007
Scan HTML ruby-talk	15	80	3 March 2008
Using hpricot to get tables ruby-talk	2	121	1 July 2008
Html parsing with Hpricot ruby-talk	2	83	9 June 2010
Using HPricot to parse a fiddly table ruby-talk	2	115	7 January 2008

Hpricot parsing

Related topics