Print - and strip text between tags using Nokogiri

I'm a Ruby Newbie trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I've already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between "pre" tags.

Picture some html like this:

<title>My Title</title>
<h1>My Heading</h1>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
text line 1
text line 2
text line 3
very important text
more important text
would you believe even more important text?
<p><!-- body="end" -->

I basically need to do 2 things: 1) to print only the text between the 2
"pre" tags, and then 2) to print all of the non-tagged text between the
"body" comments - minus the text between the "pre" tags. I've been
messing with this for a couple of hours - unsuccessfully - but I'm still
convinced that this is the right tool for the job.


If you need to do more HTML and XML manipulation, learning XPath is a
good investment! You can look here for a start:

_One_ way to achieve what you want:

require 'nokogiri'

text = <<HTML
<title>My Title</title>
<h1>My Heading</h1>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
text line 1
text line 2
text line 3
very important text
more important text
would you believe even more important text?
<p><!-- body="end" -->

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

puts '---'

puts dom.xpath('/html/body//text()[not(ancestor::pre)]').map(&:to_s)

You can also process nodes individually if you replace ".map..." with
".each" and a block which receives the node and does something with

Kind regards



remember.guy do |as, often| as.you_can - without end

Thank you for the swift reply! I tried running the above against my
"test.html" snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html



There should be a way to match the text of the first comment, but I
couldn't get this to work:



This ugly xpath will select the comment based on its text:

my_xpath = %Q{/html/body/comment()[. =' body="start"


I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I've tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:

<title>haiku_archive: watching the news</title>
<meta name="Author" content="Paul David Mena (">
<meta name="Subject" content="watching the news">
<body bgcolor="#FFFFFF" text="#000000">
<h1>watching the news</h1>
<strong>From:</strong> Paul David Mena (<a
<strong>Date:</strong> Fri Dec 14 2012 - 18:51:14 EST
<hr noshade><p>
<!-- body="start" -->
watching the news
I feel guilty
for being alive


Paul David Mena
<p><!-- body="end" -->

My ultimate goal is to extract all of the comment text between <!--
body="start" --> and <!-- body="end" --> but *not* what is between the
two "pre" tags. So far I've been able to extract all of the comment
text but not exclude the "pre" text, using the following code:

#!/usr/bin/env ruby

require "rubygems"
require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document

  attr_reader :plaintext

  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @plaintext = ""

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
    when /^body="start"/ # match starting comment
      @interesting = true
    when /^body="end"/
      @interesting = false # match closing comment

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    @plaintext << string if @interesting

# write to the screen
pte =
parser =
parser.parse_file ARGV[0]
# puts pte.plaintext

# write to a file
  file ="snippet.txt", "w")
  file.write pte.plaintext
rescue IOError => e
  #some error occur, dir not writable etc.
  file.close unless file == nil

# get the date written
fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)..(end_column-1)

IO.foreach(fname) do |line|
  if line.match(/<strong>Date:<\/strong>/)
    pieces = line.split(" ")
    puts pieces[target_range].join("-")

# remove blank lines from file
fh ='snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line

The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
watching the news
I feel guilty
for being alive
Paul David Mena

Basically I want to omit the signature (everything below the "--",
inclusive), which is wrapped in the "pre" tags.

require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document
  attr_reader :plaintext
  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @pre = false
    @plaintext = ""

  def start_element(name, attrs = [])
    if name == "pre"
      @pre = true

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
      when /^body="start"/ # match starting comment
        @interesting = true
      when /^body="end"/
        @interesting = false # match closing comment

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    if @interesting and not @pre
      @plaintext << string

pte =
parser =
parser.parse_file ARGV[0]

p pte.plaintext

"\n\nwatching the news\n\nI feel guilty\n\nfor being alive\n\n"


7stud - that's perfect. Thank you so much!


You passed in the file name string to nokugiri, not the contents.


Posted via\.

Paul Mena wrote in post #1089283:

# remove blank lines from file
fh ='snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line

I forgot to mention. There are several ways to read line by line from a
file, but your loop is particularly ugly. If you use my favorite:

IO.foreach(fname) do |line|
  line = line.chomp

The added benefit is that the file is automatically closed when the
block exits.


Posted via\.

Robert Klemme wrote in post #1089225:

I overlooked the comment thing.


require 'nokogiri'
# require 'irb'

text = <<HTML
<title>My Title</title>
<h1>My Heading</h1>
<strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST
<!-- body="start" -->
text line 1
text line 2
text line 3
very important text
more important text
would you believe even more important text?
<p><!-- body="end" -->
not to print

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()')

puts '---'

puts dom.xpath('//text()[contains(preceding::comment(),"start") and
contains(following::comment(),"end") and not(ancestor::pre)]')

Kind regards



On Sun, Dec 16, 2012 at 11:18 AM, 7stud -- <> wrote:

