I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I've tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:
<html>
<head>
<title>haiku_archive: watching the news</title>
<meta name="Author" content="Paul David Mena (pauldavidmena@gmail.com)">
<meta name="Subject" content="watching the news">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<h1>watching the news</h1>
<strong>From:</strong> Paul David Mena (<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=<CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com>"><em>pauldavidmena@gmail.com</em></a>)<br>
<strong>Date:</strong> Fri Dec 14 2012 - 18:51:14 EST
<p>
<hr noshade><p>
<!-- body="start" -->
<p>
watching the news
<br>
I feel guilty
<br>
for being alive
<br>
<p><pre>
···
--
Paul David Mena
--------------------
<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=<CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com>">pauldavidmena@gmail.com</a>
</pre>
<p><!-- body="end" -->
</body>
</html>
My ultimate goal is to extract all of the comment text between <!--
body="start" --> and <!-- body="end" --> but *not* what is between the
two "pre" tags. So far I've been able to extract all of the comment
text but not exclude the "pre" text, using the following code:
#!/usr/bin/env ruby
require "rubygems"
require "nokogiri"
class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext
# Initialize the state of interest variable with false
def initialize
@interesting = false
@plaintext = ""
end
# This method is called whenever a comment occurs and
# the comments text is passed in as string.
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body="start"/ # match starting comment
@interesting = true
when /^body="end"/
@interesting = false # match closing comment
end
end
# This callback method is called with any string between
# a tag.
def characters(string)
@plaintext << string if @interesting
end
end
# write to the screen
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
# puts pte.plaintext
# write to a file
begin
file = File.open("snippet.txt", "w")
file.write pte.plaintext
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end
# get the date written
fname = ARGV[0]
start_column = 3
end_column = 5
target_range = (start_column-1)..(end_column-1)
IO.foreach(fname) do |line|
if line.match(/<strong>Date:<\/strong>/)
pieces = line.split(" ")
puts pieces[target_range].join("-")
end
end
# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ''
# convert tab chars to blanks
line.gsub!(/\t/,' ')
# substitute a single blank for a sequence of blanks
line.squeeze!(' ')
# add code to process line if needed
puts line
end
fh.close
exit(0)
The output is as follows:
pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive
--
Paul David Mena
--------------------
pauldavidmena@gmail.com
Basically I want to omit the signature (everything below the "--",
inclusive), which is wrapped in the "pre" tags.
--
Posted via http://www.ruby-forum.com/.