Print - and strip text between tags using Nokogiri

I'm a Ruby Newbie trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I've already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between "pre" tags.

Picture some html like this:

<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
</body>
</html>

I basically need to do 2 things: 1) to print only the text between the 2
"pre" tags, and then 2) to print all of the non-tagged text between the
"body" comments - minus the text between the "pre" tags. I've been
messing with this for a couple of hours - unsuccessfully - but I'm still
convinced that this is the right tool for the job.

···

--
Posted via http://www.ruby-forum.com/.

If you need to do more HTML and XML manipulation, learning XPath is a
good investment! You can look here for a start:
http://www.w3schools.com/Xpath/default.asp

_One_ way to achieve what you want:

require 'nokogiri'

text = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
</body>
</html>
HTML

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

puts '---'

puts dom.xpath('/html/body//text()[not(ancestor::pre)]').map(&:to_s)

You can also process nodes individually if you replace ".map..." with
".each" and a block which receives the node and does something with
it.

Kind regards

robert

···

On Sun, Dec 16, 2012 at 12:10 AM, Paul Mena <lists@ruby-forum.com> wrote:

I'm a Ruby Newbie trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I've already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between "pre" tags.

Picture some html like this:

<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
</body>
</html>

I basically need to do 2 things: 1) to print only the text between the 2
"pre" tags, and then 2) to print all of the non-tagged text between the
"body" comments - minus the text between the "pre" tags. I've been
messing with this for a couple of hours - unsuccessfully - but I'm still
convinced that this is the right tool for the job.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Thank you for the swift reply! I tried running the above against my
"test.html" snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

···

---
./test.html

--
Posted via http://www.ruby-forum.com/.

There should be a way to match the text of the first comment, but I
couldn't get this to work:

comment()[text()='body="start"']

···

--
Posted via http://www.ruby-forum.com/.

This ugly xpath will select the comment based on its text:

my_xpath = %Q{/html/body/comment()[. =' body="start"
']/following-sibling::*[not(self::pre)]}

···

--
Posted via http://www.ruby-forum.com/.

I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I've tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:

<html>
<head>
<title>haiku_archive: watching the news</title>
<meta name="Author" content="Paul David Mena (pauldavidmena@gmail.com)">
<meta name="Subject" content="watching the news">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<h1>watching the news</h1>
<strong>From:</strong> Paul David Mena (<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=&lt;CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com&gt;"><em>pauldavidmena@gmail.com</em></a>)<br>
<strong>Date:</strong> Fri Dec 14 2012 - 18:51:14 EST
<p>
<hr noshade><p>
<!-- body="start" -->
<p>
watching the news
<br>
I feel guilty
<br>
for being alive
<br>
<p><pre>

···

--
Paul David Mena
--------------------
<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=&lt;CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com&gt;">pauldavidmena@gmail.com</a>
</pre>
<p><!-- body="end" -->
</body>
</html>

My ultimate goal is to extract all of the comment text between <!--
body="start" --> and <!-- body="end" --> but *not* what is between the
two "pre" tags. So far I've been able to extract all of the comment
text but not exclude the "pre" text, using the following code:

#!/usr/bin/env ruby

require "rubygems"
require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document

  attr_reader :plaintext

  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @plaintext = ""
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
    when /^body="start"/ # match starting comment
      @interesting = true
    when /^body="end"/
      @interesting = false # match closing comment
    end
  end

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    @plaintext << string if @interesting
  end
end

# write to the screen
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
# puts pte.plaintext

# write to a file
begin
  file = File.open("snippet.txt", "w")
  file.write pte.plaintext
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

# get the date written
fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)..(end_column-1)

IO.foreach(fname) do |line|
  if line.match(/<strong>Date:<\/strong>/)
    pieces = line.split(" ")
    puts pieces[target_range].join("-")
  end
end

# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)

The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive
--
Paul David Mena
--------------------
pauldavidmena@gmail.com

Basically I want to omit the signature (everything below the "--",
inclusive), which is wrapped in the "pre" tags.

--
Posted via http://www.ruby-forum.com/.

require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document
  attr_reader :plaintext
  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @pre = false
    @plaintext = ""
  end

  def start_element(name, attrs = [])
    if name == "pre"
      @pre = true
    end
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
      when /^body="start"/ # match starting comment
        @interesting = true
      when /^body="end"/
        @interesting = false # match closing comment
    end
  end

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    if @interesting and not @pre
      @plaintext << string
    end
  end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

p pte.plaintext

--output:--
"\n\nwatching the news\n\nI feel guilty\n\nfor being alive\n\n"

···

--
Posted via http://www.ruby-forum.com/.

7stud - that's perfect. Thank you so much!

···

--
Posted via http://www.ruby-forum.com/.

You passed in the file name string to nokugiri, not the contents.

···

On Sat, Dec 15, 2012 at 8:23 PM, Paul Mena <lists@ruby-forum.com> wrote:

Thank you for the swift reply! I tried running the above against my
"test.html" snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html
---
./test.html

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote in post #1089225:

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

2) to print all of the non-tagged text between the
"body" comments

Your html doesn't even test your requirements because there is no text
after the body="end' comment. And there is no non-tagged text:

require 'nokogiri'

html = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
<p>
text line 4
<br>
text line 5
</body>
</html>
HTML

doc = Nokogiri.HTML(html)

my_xpath = "/html/body/comment()[1]/following-sibling::*"

doc.xpath(my_xpath).each do |node|
  puts node.name
  puts node.text
  puts '*' * 20
end

--output:--
p

text line 1

text line 2

text line 3

···

********************
p

********************
pre

very important text
more important text
would you believe even more important text?
********************
p

********************
p

text line 4

text line 5
********************

--output:--
text line 1
text line 2
text line 3

doc = Nokogiri.HTML(html)

my_xpath =
"/html/body/comment()[1]/following-sibling::*[not(self::pre)]"

catch :found_ending_text do
  doc.xpath(my_xpath).each do |node|
    node.children.each do |child|
      text = child.text
      throw :found_ending_text if text.include? %q{body}
      next if text.empty?
      puts text.strip
    end
  end
end

--output:--
text line 1
text line 2
text line 3

--
Posted via http://www.ruby-forum.com/\.

Paul Mena wrote in post #1089283:

# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)

I forgot to mention. There are several ways to read line by line from a
file, but your loop is particularly ugly. If you use my favorite:

IO.foreach(fname) do |line|
  line = line.chomp
  ...
  ..
end

The added benefit is that the file is automatically closed when the
block exits.

···

--
Posted via http://www.ruby-forum.com/\.

7stud -- wrote in post #1089314:

I forgot to mention. There are several ways to read line by line from a
file,

Here are some others:

f = File.new(fname)

f.each do |line|
  line = line.chomp

end

f.close

···

====

File.open(fname) do |f|
  while line = f.gets #gets() returns nil at eof
  ...
  end
end #file is automatically closed here

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote in post #1089225:

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

Right, thanks for the reminder! That was an artifact of IRB testing. :slight_smile:

2) to print all of the non-tagged text between the
"body" comments

Your html doesn't even test your requirements because there is no text
after the body="end' comment. And there is no non-tagged text:

I overlooked the comment thing.

#!/usr/bin/ruby

require 'nokogiri'
# require 'irb'

text = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<strong>From:</strong>Me<br>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
<p>
text line 1
<br>
text line 2
<br>
text line 3
<br>
<p><pre>
very important text
more important text
would you believe even more important text?
</pre>
<p><!-- body="end" -->
not to print
</body>
</html>
HTML

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()')

puts '---'

puts dom.xpath('//text()[contains(preceding::comment(),"start") and
contains(following::comment(),"end") and not(ancestor::pre)]')

Kind regards

robert

···

On Sun, Dec 16, 2012 at 11:18 AM, 7stud -- <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/