Print - and strip text between tags using Nokogiri

7stud2 · 15 December 2012 23:10

I'm a Ruby Newbie trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I've already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between "pre" tags.

Picture some html like this:

<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
From:Me 
Date: Wed Dec 05 2012 - 18:17:49 EST


text line 1
 
text line 2
 
text line 3
 
<pre>
very important text
more important text
would you believe even more important text?
</pre>

</body>
</html>

I basically need to do 2 things: 1) to print only the text between the 2
"pre" tags, and then 2) to print all of the non-tagged text between the
"body" comments - minus the text between the "pre" tags. I've been
messing with this for a couple of hours - unsuccessfully - but I'm still
convinced that this is the right tool for the job.

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 15 December 2012 23:28

If you need to do more HTML and XML manipulation, learning XPath is a
good investment! You can look here for a start:
http://www.w3schools.com/Xpath/default.asp

_One_ way to achieve what you want:

require 'nokogiri'

text = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
From:Me 
Date: Wed Dec 05 2012 - 18:17:49 EST


text line 1
 
text line 2
 
text line 3
 
<pre>
very important text
more important text
would you believe even more important text?
</pre>

</body>
</html>
HTML

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

puts '---'

puts dom.xpath('/html/body//text()[not(ancestor::pre)]').map(&:to_s)

You can also process nodes individually if you replace ".map..." with
".each" and a block which receives the node and does something with
it.

Kind regards

robert

···

On Sun, Dec 16, 2012 at 12:10 AM, Paul Mena <lists@ruby-forum.com> wrote:

I'm a Ruby Newbie trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I've already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between "pre" tags.

Picture some html like this:

<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
From:Me 
Date: Wed Dec 05 2012 - 18:17:49 EST


text line 1
 
text line 2
 
text line 3
 
<pre>
very important text
more important text
would you believe even more important text?
</pre>

</body>
</html>

I basically need to do 2 things: 1) to print only the text between the 2
"pre" tags, and then 2) to print all of the non-tagged text between the
"body" comments - minus the text between the "pre" tags. I've been
messing with this for a couple of hours - unsuccessfully - but I'm still
convinced that this is the right tool for the job.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 16 December 2012 02:23

Thank you for the swift reply! I tried running the above against my
"test.html" snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

···

---
./test.html

--
Posted via http://www.ruby-forum.com/.

7stud2 · 16 December 2012 10:37

There should be a way to match the text of the first comment, but I
couldn't get this to work:

comment()[text()='body="start"']

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 16 December 2012 10:52

This ugly xpath will select the comment based on its text:

my_xpath = %Q{/html/body/comment()[. =' body="start"
']/following-sibling::*[not(self::pre)]}

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 16 December 2012 16:11

I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I've tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:

<html>
<head>
<title>haiku_archive: watching the news</title>
<meta name="Author" content="Paul David Mena (pauldavidmena@gmail.com)">
<meta name="Subject" content="watching the news">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<h1>watching the news</h1>
From: Paul David Mena (<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=<CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com>">pauldavidmena@gmail.com</a>) 
Date: Fri Dec 14 2012 - 18:51:14 EST

<hr noshade>


watching the news
 
I feel guilty
 
for being alive
 
<pre>

···

--
Paul David Mena
--------------------
<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=<CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com>">pauldavidmena@gmail.com</a>
</pre>

</body>
</html>

My ultimate goal is to extract all of the comment text between  and  but *not* what is between the
two "pre" tags. So far I've been able to extract all of the comment
text but not exclude the "pre" text, using the following code:

#!/usr/bin/env ruby

require "rubygems"
require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document

attr_reader :plaintext

  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @plaintext = ""
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
    when /^body="start"/ # match starting comment
      @interesting = true
    when /^body="end"/
      @interesting = false # match closing comment
    end
  end

# This callback method is called with any string between
 # a tag.
 def characters(string)
 @plaintext << string if @interesting
 end
end

# write to the screen
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
# puts pte.plaintext

# write to a file
begin
  file = File.open("snippet.txt", "w")
  file.write pte.plaintext
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

# get the date written
fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)..(end_column-1)

IO.foreach(fname) do |line|
 if line.match(/Date:<\/strong>/)
 pieces = line.split(" ")
 puts pieces[target_range].join("-")
 end
end

# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)

The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive
--
Paul David Mena
--------------------
pauldavidmena@gmail.com

Basically I want to omit the signature (everything below the "--",
inclusive), which is wrapped in the "pre" tags.

--
Posted via http://www.ruby-forum.com/.

7stud2 · 16 December 2012 21:17

require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document
 attr_reader :plaintext
 # Initialize the state of interest variable with false
 def initialize
 @interesting = false
 @pre = false
 @plaintext = ""
 end

  def start_element(name, attrs = [])
    if name == "pre"
      @pre = true
    end
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip # strip leading and trailing whitespaces
      when /^body="start"/ # match starting comment
        @interesting = true
      when /^body="end"/
        @interesting = false # match closing comment
    end
  end

# This callback method is called with any string between
 # a tag.
 def characters(string)
 if @interesting and not @pre
 @plaintext << string
 end
 end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

p pte.plaintext

--output:--
"\n\nwatching the news\n\nI feel guilty\n\nfor being alive\n\n"

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 16 December 2012 23:57

7stud - that's perfect. Thank you so much!

···

--
Posted via http://www.ruby-forum.com/.

Tamara_Temple1 · 16 December 2012 03:21

You passed in the file name string to nokugiri, not the contents.

···

On Sat, Dec 15, 2012 at 8:23 PM, Paul Mena <lists@ruby-forum.com> wrote:

Thank you for the swift reply! I tried running the above against my
"test.html" snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html
---
./test.html

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 16 December 2012 10:18

Robert Klemme wrote in post #1089225:

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

2) to print all of the non-tagged text between the
"body" comments

Your html doesn't even test your requirements because there is no text
after the body="end' comment. And there is no non-tagged text:

require 'nokogiri'

html = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
From:Me 
Date: Wed Dec 05 2012 - 18:17:49 EST


text line 1
 
text line 2
 
text line 3
 
<pre>
very important text
more important text
would you believe even more important text?
</pre>


text line 4
 
text line 5
</body>
</html>
HTML

doc = Nokogiri.HTML(html)

my_xpath = "/html/body/comment()[1]/following-sibling::*"

doc.xpath(my_xpath).each do |node|
  puts node.name
  puts node.text
  puts '*' * 20
end

--output:--
p

text line 1

text line 2

text line 3

···

********************
p

********************
pre

very important text
more important text
would you believe even more important text?
********************
p

********************
p

text line 4

text line 5
********************

--output:--
text line 1
text line 2
text line 3

doc = Nokogiri.HTML(html)

my_xpath =
"/html/body/comment()[1]/following-sibling::*[not(self::pre)]"

catch :found_ending_text do
  doc.xpath(my_xpath).each do |node|
    node.children.each do |child|
      text = child.text
      throw :found_ending_text if text.include? %q{body}
      next if text.empty?
      puts text.strip
    end
  end
end

--output:--
text line 1
text line 2
text line 3

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 17 December 2012 01:01

Paul Mena wrote in post #1089283:

# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)

I forgot to mention. There are several ways to read line by line from a
file, but your loop is particularly ugly. If you use my favorite:

IO.foreach(fname) do |line|
  line = line.chomp
  ...
  ..
end

The added benefit is that the file is automatically closed when the
block exits.

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 17 December 2012 06:12

7stud -- wrote in post #1089314:

I forgot to mention. There are several ways to read line by line from a
file,

Here are some others:

f = File.new(fname)

f.each do |line|
line = line.chomp

end

f.close

···

====

File.open(fname) do |f|
  while line = f.gets #gets() returns nil at eof
  ...
  end
end #file is automatically closed here

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 17 December 2012 08:11

Robert Klemme wrote in post #1089225:

puts dom.xpath('/html/body//pre/text()').map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

Right, thanks for the reminder! That was an artifact of IRB testing.

2) to print all of the non-tagged text between the
"body" comments

Your html doesn't even test your requirements because there is no text
after the body="end' comment. And there is no non-tagged text:

I overlooked the comment thing.

#!/usr/bin/ruby

require 'nokogiri'
# require 'irb'

text = <<HTML
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
From:Me 
Date: Wed Dec 05 2012 - 18:17:49 EST


text line 1
 
text line 2
 
text line 3
 
<pre>
very important text
more important text
would you believe even more important text?
</pre>

not to print
</body>
</html>
HTML

dom = Nokogiri.HTML(text)

puts dom.xpath('/html/body//pre/text()')

puts '---'

puts dom.xpath('//text()[contains(preceding::comment(),"start") and
contains(following::comment(),"end") and not(ancestor::pre)]')

Kind regards

robert

···

On Sun, Dec 16, 2012 at 11:18 AM, 7stud -- <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Nokogiri help parsing HTML ruby-talk	17	508	29 March 2013
Nokogiri extract text? ruby-talk	3	97	10 April 2011
Help missing something BASIC ruby-talk	10	98	21 October 2010
Regular expression ruby-talk	7	100	23 March 2009
Stripping unwanted html ruby-talk	6	77	9 October 2006

Print - and strip text between tags using Nokogiri

Related topics