Nokogiri help parsing HTML

7stud2 · 26 March 2013 22:40

I'm relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database. In
the interim, I'm writing it to a text file for troubleshooting purposes.

Here's the relevant piece of the HTML I'd like to parse:

<div class="mail">
<address class="headers">
<span id="from">
<dfn>From</dfn>: Paul David Mena <<a
href="mailto:pauldavidmena_at_gmail.com?Subject=Re:%20twilight">pauldavidmena_at_gmail.com</a>>
</span><br />
<span id="date"><dfn>Date</dfn>: Tue, 26 Mar 2013 18:13:21
-0400</span><br />
</address>
<p>
Line 1
<br />
Line 2
<br />
Line 3
<br />
<p><pre>

···

--
Paul David Mena
--------------------
pauldavidmena_at_gmail.com
</pre>
<span id="received"><dfn>Received on</dfn> Tue Mar 26 2013 - 22:13:23
EDT</span>
</div>

My goal is to strip out everything between the "address" and "pre" tags
and to output only:

Line 1

Line 2

Line 3

My code, however, is stripping out one or the other, depending upon
where I place the definition. Here is the code:

#!/usr/bin/env ruby

require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document
  attr_reader :plaintext
  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @pre = false
    @address = false
    @plaintext = ""
  end

  def start_element(name, attrs = [])
    if name == "address"
      @address = true
    end
  end

  def end_element(name, attrs = [])
    if name == "address"
      @address = false
    end
  end

  def start_element(name, attrs = [])
    if name == "pre"
      @pre = true
    end
  end

  def end_element(name, attrs = [])
    if name == "pre"
      @pre = false
    end
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
      when /^body="start"/ # match starting comment
        @interesting = true
      when /^body="end"/
        @interesting = false # match closing comment
    end
  end

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    if @interesting and not @pre
      if @interesting and not @address
        @plaintext << string
      end
    end
  end
end

fname = ARGV[0]
start_column = 4
end_column = 6

target_range = (start_column-1)..(end_column-1)
IO.foreach(fname) do |line|
if line.match(/<dfn>Date<\/dfn>/)
pieces = line.split(" ")

@date_string = pieces[target_range].join("-")
# puts @date_string
end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

# puts pte.plaintext

begin
  file = File.open("snippet.txt", "w")
  file.write(@date_string)
  file.write(pte.plaintext)
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 26 March 2013 23:05

OK, so you want every tag that is a sibling of address and pre and is
within those two. I have found this StackOverFlow answer:

which applied to your problem:

1.9.2p290 :001 > require 'nokogiri'
=> true
1.9.2p290 :002 > s = <<END
1.9.2p290 :003"> 
1.9.2p290 :004"> <div class="mail">
[...snip...]
1.9.2p290 :031 > doc = Nokogiri::HTML(s)
1.9.2p290 :039 >
doc.xpath("//address/following-sibling::node()[count(.|
//pre/preceding-sibling::node())=count(//pre/preceding-sibling::node())]")
=> [#<Nokogiri::XML::Text:0xdb0114 "\n">,
#<Nokogiri::XML::Element:0xdaff84 name="p"
children=[#<Nokogiri::XML::Text:0xdafc28 "\nLine 1\n">,
#<Nokogiri::XML::Element:0xdafa5c name="br">,
#<Nokogiri::XML::Text:0xdaf728 "\nLine 2\n">,
#<Nokogiri::XML::Element:0xdaf5ac name="br">,
#<Nokogiri::XML::Text:0xdaf228 "\nLine 3\n">,
#<Nokogiri::XML::Element:0xdaf0ac name="br">]>,
#<Nokogiri::XML::Element:0xdaea80 name="p">]

will return a node set that contains the required nodes.

Hope this helps,

Jesus.

···

On Tue, Mar 26, 2013 at 11:40 PM, Paul Mena <lists@ruby-forum.com> wrote:

I'm relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database. In
the interim, I'm writing it to a text file for troubleshooting purposes.

Here's the relevant piece of the HTML I'd like to parse:


<div class="mail">
<address class="headers">
<span id="from">
<dfn>From</dfn>: Paul David Mena <<a
href="mailto:pauldavidmena_at_gmail.com?Subject=Re:%20twilight">pauldavidmena_at_gmail.com</a>>
</span><br />
<span id="date"><dfn>Date</dfn>: Tue, 26 Mar 2013 18:13:21
-0400</span><br />
</address>
<p>
Line 1
<br />
Line 2
<br />
Line 3
<br />
<p><pre>
--
Paul David Mena
--------------------
pauldavidmena_at_gmail.com
</pre>
<span id="received"><dfn>Received on</dfn> Tue Mar 26 2013 - 22:13:23
EDT</span>
</div>


My goal is to strip out everything between the "address" and "pre" tags
and to output only:

7stud2 · 27 March 2013 14:57

Thanks to all for the help. The following seems to do most of what I
want:

dom.xpath('//address/following-sibling::p//text()').each {|n| p n}

My next task is to capture only the desired text, and to write it to a
file. Specifically:

Line 1\n
Line 2\n
Line 3\n

The above code writes the following to standard out:

#<Nokogiri::XML::Text:0xba8f9c "\nLine 1\n">
#<Nokogiri::XML::Text:0xba8eac "\nLine 2\n">
#<Nokogiri::XML::Text:0xba8dd0 "\nLine 3\n">

So close!

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 March 2013 15:18

Isn't that as simple as "p n.text"?

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 March 2013 16:44

I should probably include the whole revised program for context. The
argument is the path to an HTML file that contains relevant text between
the body=start and body=end tags.

#!/usr/bin/env ruby

# some initializations

@interesting = false
@my_text = ""

# read the file between the two "body" tags and stash in "my_text"

fname = ARGV[0]

IO.foreach(fname) do |line|
  if line.match(/body="start"/)
    @interesting = true
  end

# meanwhile let's grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)..(end_column-1)

  if line.match(/<dfn>Date<\/dfn>/)
    pieces = line.split(" ")
    @date_string = pieces[target_range].join("-")
  end

  if line.match(/body="end"/)
    @interesting = false
  end

  if @interesting
    @my_text << line
  end
end
# puts @haiku_text

require "nokogiri"

doc = Nokogiri::HTML(@my_text)
doc.xpath('//address/following-sibling::p//text()').each {|n| p n}

# puts doc

begin
  file = File.open("snippet.txt", "w")
  file.write(@date_string)
  file.write(doc)
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 March 2013 16:56

How about this?

doc = Nokogiri::HTML(@my_text)

output = @date_string + $/
doc.xpath('//address/following-sibling::p//text()').each { |line| output
<< ( line.text.strip + $/ ) }

File.write("snippet.txt", output)

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 March 2013 17:55

Just out of curiosity I had a go at writing this myself, with the
exception of that complicated xpath because I don't really understand
xpath yet

This is what I came up with:

require 'nokogiri'
doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css('span[@id="date"]').first.text[/\d+ \w+ \d+/].gsub('
','-') + $/
path = '//address/following-sibling::p//text()'
doc.xpath(path).each { |line| output << line.text.strip << $/ }
File.write("snippet.txt", output)

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 March 2013 18:05

This actually worked perfectly:

require "nokogiri"

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath('//address/following-sibling::p//text()').each { |line| output
<< ( line.text.strip + $/ ) }

File.write("snippet.txt", output)

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 28 March 2013 08:35

You still have @my_text and @date_string there, which leads me to
suspect that's only the last part of your script.
The example I gave is the entire script...

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 09:28

Since you're using Nokogiri anyway, you'd be better off using that for
the whole process rather than looping through the HTML "manually". This
is the sort of thing it's worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 16:39

That makes quite a difference! Here's what I have after ripping out the
old logic and extracting the date using Nokogiri:

#!/usr/bin/env ruby

require "nokogiri"

# get the date

doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css('span[@id="date"]').first.text[/\d+ \w+ \d+/].gsub('
','-') + $/

# get the remaining text

path = '//address/following-sibling::p//text()'
doc.xpath(path).each { |line| output << line.text.strip << $/ }

File.write("snippet.txt", output)

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 16:50

Looks good, Ruby's pretty amazing when it comes to finding simple ways
to do complex things.
Are there any parts of that code you need clarifying? It'll help to
understand all the methods used here so you can write your own more
easily in future.
For example, you can test regular expressions here:
http://www.rubular.com/

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 16:55

It really helped to run the code in IRB to see what Nokogiri was doing.
It will take me a little bit longer to wrap my mind around how Ruby does
regular expressions, but it certainly seems worth the effort.

Thanks for the link, and for all of the help!

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 22:49

I do have a follow-up question, if you don't mind. I can see how the
"address" tag is stripped, but not the "pre" tag. Amazing how much
heavy lifting is accomplished with a simple (or, at least to me, not so
simple) line of code.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 March 2013 23:02

I'm not sure what you mean by stripping the tags. Firstly, the xpath is
looking for <p> after <address>, which doesn't include <pre> in your
example. Secondly if you ask Nokogiri for the "text", it won't include
any html tags.

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 27 March 2013 08:25

Your version also outputs <p> tags, doesn't it? A modified version of yours

irb(main):036:0>
dom.xpath('//address/following-sibling::*//text()').each {|n| p n}
#<Nokogiri::XML::Text:0x..fc01de63e "\nLine 1\n">
#<Nokogiri::XML::Text:0x..fc01de4d6 "\nLine 2\n">
#<Nokogiri::XML::Text:0x..fc01de36e "\nLine 3\n">
#<Nokogiri::XML::Text:0x..fc01de206 "\n">
#<Nokogiri::XML::Text:0x..fc01ddf04 "\n">
=> 0

irb(main):037:0>
dom.xpath('//address/following-sibling::p//text()').each {|n| p n}
#<Nokogiri::XML::Text:0x..fc01de63e "\nLine 1\n">
#<Nokogiri::XML::Text:0x..fc01de4d6 "\nLine 2\n">
#<Nokogiri::XML::Text:0x..fc01de36e "\nLine 3\n">
#<Nokogiri::XML::Text:0x..fc01de206 "\n">
=> 0

Here's another approach: find everything under <div class="mail"> but
not under <address>:

irb(main):032:0>
dom.xpath('//div[@class="mail"]//text()[not(ancestor::address)]').each
{|n| p n}
#<Nokogiri::XML::Text:0x..fc01c07c4 "\n">
#<Nokogiri::XML::Text:0x..fc01de7b0 "\n">
#<Nokogiri::XML::Text:0x..fc01de63e "\nLine 1\n">
#<Nokogiri::XML::Text:0x..fc01de4d6 "\nLine 2\n">
#<Nokogiri::XML::Text:0x..fc01de36e "\nLine 3\n">
#<Nokogiri::XML::Text:0x..fc01de206 "\n">
#<Nokogiri::XML::Text:0x..fc01ddf04 "\n">
=> 0

TIMTOWTDI

Kind regards

robert

···

On Wed, Mar 27, 2013 at 12:05 AM, Jesús Gabriel y Galán <jgabrielygalan@gmail.com> wrote:

On Tue, Mar 26, 2013 at 11:40 PM, Paul Mena <lists@ruby-forum.com> wrote:

My goal is to strip out everything between the "address" and "pre" tags
and to output only:

OK, so you want every tag that is a sibling of address and pre and is
within those two. I have found this StackOverFlow answer:

html - XPath Expression: Select elements between A HREF="expr" tags - Stack Overflow

which applied to your problem:

1.9.2p290 :001 > require 'nokogiri'
=> true
1.9.2p290 :002 > s = <<END
1.9.2p290 :003"> 
1.9.2p290 :004"> <div class="mail">
[...snip...]
1.9.2p290 :031 > doc = Nokogiri::HTML(s)
1.9.2p290 :039 >
doc.xpath("//address/following-sibling::node()[count(.|
//pre/preceding-sibling::node())=count(//pre/preceding-sibling::node())]")
=> [#<Nokogiri::XML::Text:0xdb0114 "\n">,
#<Nokogiri::XML::Element:0xdaff84 name="p"
children=[#<Nokogiri::XML::Text:0xdafc28 "\nLine 1\n">,
#<Nokogiri::XML::Element:0xdafa5c name="br">,
#<Nokogiri::XML::Text:0xdaf728 "\nLine 2\n">,
#<Nokogiri::XML::Element:0xdaf5ac name="br">,
#<Nokogiri::XML::Text:0xdaf228 "\nLine 3\n">,
#<Nokogiri::XML::Element:0xdaf0ac name="br">]>,
#<Nokogiri::XML::Element:0xdaea80 name="p">]

will return a node set that contains the required nodes.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 28 March 2013 12:07

Joel Pearson wrote in post #1103448:

You still have @my_text and @date_string there, which leads me to
suspect that's only the last part of your script.
The example I gave is the entire script...

You're right. Here's the whole thing:

#!/usr/bin/env ruby

# some initializations

@interesting = false
@my_text = ""

# read the file between the two "body" tags

fname = ARGV[0]

IO.foreach(fname) do |line|
  if line.match(/body="start"/)
    @interesting = true
  end

# meanwhile let's grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)..(end_column-1)

  if line.match(/<dfn>Date<\/dfn>/)
    pieces = line.split(" ")
    @date_string = pieces[target_range].join("-")
  end

  if line.match(/body="end"/)
    @interesting = false
  end

  if @interesting
    @my_text << line
  end
end

require "nokogiri"

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath('//address/following-sibling::p//text()').each { |line| output
<< ( line.text.strip + $/ ) }

File.write("snippet.txt", output)

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 29 March 2013 15:17

Joel Pearson wrote in post #1103626:

Since you're using Nokogiri anyway, you'd be better off using that for
the whole process rather than looping through the HTML "manually". This
is the sort of thing it's worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

I completely missed that in your earlier post! It definitely makes
sense to use Nokogiri to do exactly what it's good at.

Thanks!

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Print - and strip text between tags using Nokogiri ruby-talk	12	596	17 December 2012
Help missing something BASIC ruby-talk	10	98	21 October 2010
Extracting some text from HTML ruby-talk	2	142	2 November 2010
Parsing Newb Help ruby-talk	4	119	5 September 2012
Nokogiri extract text? ruby-talk	3	97	10 April 2011

Nokogiri help parsing HTML

Related topics