I'm relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database. In
the interim, I'm writing it to a text file for troubleshooting purposes.
Here's the relevant piece of the HTML I'd like to parse:
<!-- body="start" -->
<div class="mail">
<address class="headers">
<span id="from">
<dfn>From</dfn>: Paul David Mena <<a
href="mailto:pauldavidmena_at_gmail.com?Subject=Re:%20twilight">pauldavidmena_at_gmail.com</a>>
</span><br />
<span id="date"><dfn>Date</dfn>: Tue, 26 Mar 2013 18:13:21
-0400</span><br />
</address>
<p>
Line 1
<br />
Line 2
<br />
Line 3
<br />
<p><pre>
···
--
Paul David Mena
--------------------
pauldavidmena_at_gmail.<!--nospam-->com
</pre>
<span id="received"><dfn>Received on</dfn> Tue Mar 26 2013 - 22:13:23
EDT</span>
</div>
<!-- body="end" -->
My goal is to strip out everything between the "address" and "pre" tags
and to output only:
Line 1
Line 2
Line 3
My code, however, is stripping out one or the other, depending upon
where I place the definition. Here is the code:
#!/usr/bin/env ruby
require "nokogiri"
class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext
# Initialize the state of interest variable with false
def initialize
@interesting = false
@pre = false
@address = false
@plaintext = ""
end
def start_element(name, attrs = [])
if name == "address"
@address = true
end
end
def end_element(name, attrs = [])
if name == "address"
@address = false
end
end
def start_element(name, attrs = [])
if name == "pre"
@pre = true
end
end
def end_element(name, attrs = [])
if name == "pre"
@pre = false
end
end
# This method is called whenever a comment occurs and
# the comments text is passed in as string.
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body="start"/ # match starting comment
@interesting = true
when /^body="end"/
@interesting = false # match closing comment
end
end
# This callback method is called with any string between
# a tag.
def characters(string)
if @interesting and not @pre
if @interesting and not @address
@plaintext << string
end
end
end
end
fname = ARGV[0]
start_column = 4
end_column = 6
target_range = (start_column-1)..(end_column-1)
IO.foreach(fname) do |line|
if line.match(/<dfn>Date<\/dfn>/)
pieces = line.split(" ")
@date_string = pieces[target_range].join("-")
# puts @date_string
end
end
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
# puts pte.plaintext
begin
file = File.open("snippet.txt", "w")
file.write(@date_string)
file.write(pte.plaintext)
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end
--
Posted via http://www.ruby-forum.com/.