Parsing through downloaded html

Hi all,

I've collected a number of thousands of .hmtl documents and I need to
know how to parse through all these documents (that are in one folder)
automatically.

So, I want to copy certain parts of all of these .html documents (for
example the header), but the websites are offline, on my hard disk, in
stead of online.

What's the way to go?

···

--
Posted via http://www.ruby-forum.com/.

http://nokogiri.org/ is great for this. You need parsing html, look at
tutorial on their site:
Parsing an HTML/XML document - Nokogiri

···

2012/9/6 Sybren Kooistra <lists@ruby-forum.com>

Hi all,

I've collected a number of thousands of .hmtl documents and I need to
know how to parse through all these documents (that are in one folder)
automatically.

So, I want to copy certain parts of all of these .html documents (for
example the header), but the websites are offline, on my hard disk, in
stead of online.

What's the way to go?

--
Posted via http://www.ruby-forum.com/\.

Thank you Ivan.

I am familiar with nokogiri (and open-uri), but have only used it to
download online websites. I am a complete newb, so I was curious how to
start:

1. How to automatically open (parse) all the .htmls in a folder, one by
one?
2. When one of these files is opened, how to parse through it, copy
certain parts to an excel like document, and close it again?

···

--
Posted via http://www.ruby-forum.com/.

Thanks Jesus.

So something like:

require 'nokogiri'
Dir['*.html'].each do |file|
document = Nokogiri::HTML(open(file))
variable = document.xpath("//div/h2")
[and something to PUT the variable in a specified excel column/csv file)

??

(again, I am a newb =))

···

--
Posted via http://www.ruby-forum.com/.

Hi Ivan, thanks.

I'm doing my best here, and I do search before I post. I was just
checking if I was working in the right direction. Excel library is step
two.

···

--
Posted via http://www.ruby-forum.com/.

Well, this topic took an interesting turn =)

Ivan, thanks for the 'spreadsheet' tip + code. I got me a lot further,
but I´m still running into some walls. Mostly, at the moment I need to
know how to specify column and row for variables: in a way that for
every next document I parse the variables will be put in the same
columns, but the next row.

so column a, column b
first document: variable 1 = column a, row 1 | variable 2 = column b,
row 1
second document: variable 1 = column a, row 2 | variabele 2 = column b,
row 2.
etcetera.

the code so far:

# First the basic code, including the opening of a new spreadsheet:
require 'nokogiri'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet

# Now to parse through all downloaded .htmls:

Dir.chdir(“anattempt”)
Dir.glob['*.html'].each do |document|
f = file.open(document)
searchablefile = Nokogiri::HTML(f)
variabelebasedonaxpath = searchablefile.xpath("//h1[contains(text(),
'Harbers']")

# Now to save the variable(s) in the spreadsheet (..but how to?)
row = ? (push ?)
Column = ? (column.push ?)

book.write 'htmltoexcel.xls'

···

--
Posted via http://www.ruby-forum.com/.

I get the error 'block in irb_binding' 'in 'each'' ?

code =

require 'Nokogiri'
require 'Spreadsheet'
Spreadsheet.client_encoding = "UTF-8"
book = Speadsheet::Workbook.new
sheet1 = book.create_worksheet
row = 0
Dir.chdir("anattempt")
Dir.glob('*.html').each do |document| #apparently something goes wrong
here
f = file.open(document)
searchablefile = Nokogiri::HTML(f)

variabelekoloma = searchablefle.xpath("//h1")
sheet1[row,0] = variabelekoloma.content
row += 1
end
book.write 'htmltoexcel.xls'

# Plus, how could i save boolean? (for example: if xpath has
content/returns text, then 'yes').

···

--
Posted via http://www.ruby-forum.com/.

035 > Dir.glob('*.html').each do |document|
036 > f = File.open(document)
037?> searchablefile = Nokogiri::HTML(f)
038?> variabelekoloma = searchablefile.xpath("//h1")
039?> shee1[row, 0] = variabelekoloma.content
040?> row += 1
041?> end
NoMethodError: undefined method 'content' for
#Nokogiri::XML::NodeSet:0xa501bbc>
from (irb):39: in 'block in irb_binding'
from (irb):35: in 'each'
from (irb):35

···

--
Posted via http://www.ruby-forum.com/.

Alright, perfect. It works.

However, many variables that I create are based on the POSSIBILITY that
a certain xpath/string exists in a file. If it doesnt, it should
preferably return a nil or a 'no'.

By using at_xpath and then .content, if an xpath (for example pvda =
searchablefile.at_xpath("//h1[contains(text(), 'pvda')]") returns
nothing, the variable.content (in this case sheet1[row,1] =
pvda.content) does not work, and it no longer runs the rest of the code
(error = undefined method 'content' for nil:nilClass (NoMethodError)

How can I work around this and/or save boolean on the basis of xpath?

···

--
Posted via http://www.ruby-forum.com/.

Works like a charm :slight_smile:

Any idea how I can save boolean (in some cases I do not want the content
of a node, but I only want to know IF a certain word IS or IS NOT part
of the html)

···

--
Posted via http://www.ruby-forum.com/.

Thank you Ivan.

I am familiar with nokogiri (and open-uri), but have only used it to
download online websites. I am a complete newb, so I was curious how to
start:

1. How to automatically open (parse) all the .htmls in a folder, one by
one?

Dir['*.html'].each do |file|
  ...
end

2. When one of these files is opened, how to parse through it, copy
certain parts to an excel like document, and close it again?

You can parse through it with Nokogiri, extract the parts you want
using Nokogiri methods to seach for content inside HTML. If you want
to write an excel file you will need to use an excel API (no idea), or
if you can generate a CSV you can use the stdlib CVS class.

Jesus.

···

On Thu, Sep 6, 2012 at 11:30 AM, Sybren Kooistra <lists@ruby-forum.com> wrote:

It's ok if you are a new to Ruby but don't be so passive.
You are not first that is looking for excel library (in Ruby terms it's
called gem). First look for that.
After you find these libraries pick one and search Google for usage or go
to their site there you will find support or documentation.
That's the way all of us works.

···

2012/9/6 Sybren Kooistra <lists@ruby-forum.com>

Thanks Jesus.

So something like:

require 'nokogiri'
Dir['*.html'].each do |file|
document = Nokogiri::HTML(open(file))
variable = document.xpath("//div/h2")
[and something to PUT the variable in a specified excel column/csv file)

??

(again, I am a newb =))

--
Posted via http://www.ruby-forum.com/\.

I've done similar job recently. I've copied part of code so look at it
maybe it could help you:
require 'nokogiri'
require 'spreadsheet'

class XmlParsing
  def test
    row = 0
    column = 0
    book = Spreadsheet::Workbook.new
    sheet = book.create_worksheet

    Dir.chdir("xml")
    puts Dir.pwd
    Dir.glob("*.xml") do |file|
      f = File.open(file)
      doc = Nokogiri::XML(f)
      fullname_node = doc.at_xpath("//full_name")
      sheet[row,column] = fullname_node.content
      f.close
      row += 1
    end
    book.write 'spreadsheet.xls'
  end
end

xml = XmlParsing.new
xml.test

I'm not sure if it works because I've deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.

Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.

Or there is better way to do it. Make one file named Gemfile and put this
content into it:
source 'https://rubygems.org'

gem 'nokogiri'
gem 'spreadsheet'

Save file and go to console cd to dir where Gemfile is located and run:
bundle install

This is preferred way to go because there is no need to install gems one by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.

Hope it helps.

···

2012/9/6 Sybren Kooistra <lists@ruby-forum.com>

Hi Ivan, thanks.

I'm doing my best here, and I do search before I post. I was just
checking if I was working in the right direction. Excel library is step
two.

--
Posted via http://www.ruby-forum.com/\.

require 'nokogiri'
require 'spreadsheet'

Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet

# Numbering is zero based. This means that first row is labeled 0, first
column 0.
row = 0

Dir.chdir("anattempt")
Dir.glob['*.html'].each do |document|
  f = file.open(document)
  searchablefile = Nokogiri::HTML(f)

  # use at_xpath rather than xpath since first one method returns just 1
element,
  # but second method xpath returns array of all found records matching
criteria
  var1 = searchablefile.at_xpath("your xpath here..")
  var2 = searchablefile.at_xpath("your xpath here..")

  # In first pass it saves data to first row, and two columns A and B.
  # Every nest pass increments row by 1, but columns are same A and B.
  sheet1[row, 0] = variabelebasedonaxpath.content
  shhet1[row, 1] = variabelebasedonaxpath.content

  #After saving data increment row position by 1
  row += 1
end

book.write 'htmltoexcel.xls'

I didn't tested this, but if something goes wrong ask here.
Also read http://nokogiri.org/tutorials for learning how to parse xml/html
documents, that's short but useful resource.

Copy here full error.

···

2012/9/10 Sybren Kooistra <lists@ruby-forum.com>

I get the error 'block in irb_binding' 'in 'each'' ?

code =

require 'Nokogiri'
require 'Spreadsheet'
Spreadsheet.client_encoding = "UTF-8"
book = Speadsheet::Workbook.new
sheet1 = book.create_worksheet
row = 0
Dir.chdir("anattempt")
Dir.glob('*.html').each do |document| #apparently something goes wrong
here
f = file.open(document)
searchablefile = Nokogiri::HTML(f)

variabelekoloma = searchablefle.xpath("//h1")
sheet1[row,0] = variabelekoloma.content
row += 1
end
book.write 'htmltoexcel.xls'

# Plus, how could i save boolean? (for example: if xpath has
content/returns text, then 'yes').

--
Posted via http://www.ruby-forum.com/\.

Please read tutorial I've gave you and copy code exactly as I gave you.
In my code there is at_xpath("//h1") in yours xpath("//h1"). at_xpath
returns first item matching criteria, while xpath returns array of
elements. So this error tells you that there is no content method for Array
class.

···

2012/9/10 Sybren Kooistra <lists@ruby-forum.com>

035 > Dir.glob('*.html').each do |document|
036 > f = File.open(document)
037?> searchablefile = Nokogiri::HTML(f)
038?> variabelekoloma = searchablefile.xpath("//h1")
039?> shee1[row, 0] = variabelekoloma.content
040?> row += 1
041?> end
NoMethodError: undefined method 'content' for
#Nokogiri::XML::NodeSet:0xa501bbc>
from (irb):39: in 'block in irb_binding'
from (irb):35: in 'each'
from (irb):35

--
Posted via http://www.ruby-forum.com/\.

Sure it could be issue. Here is possible solution:
unless searchablefile.at_xpath("//h1[contains(text(), 'pvda')]").nil?
  sheet[row,column] = searchablefile.at_xpath("//h1[contains(text(),
'pvda')]").content
end

···

2012/9/11 Sybren Kooistra <lists@ruby-forum.com>

Alright, perfect. It works.

However, many variables that I create are based on the POSSIBILITY that
a certain xpath/string exists in a file. If it doesnt, it should
preferably return a nil or a 'no'.

By using at_xpath and then .content, if an xpath (for example pvda =
searchablefile.at_xpath("//h1[contains(text(), 'pvda')]") returns
nothing, the variable.content (in this case sheet1[row,1] =
pvda.content) does not work, and it no longer runs the rest of the code
(error = undefined method 'content' for nil:nilClass (NoMethodError)

How can I work around this and/or save boolean on the basis of xpath?

--
Posted via http://www.ruby-forum.com/\.

Find some ruby regex tutorial on google.
Example tutorial

Here is example code from that tutorial:

#!/usr/bin/ruby

line1 = "Cats are smarter than dogs";
line2 = "Dogs also like meat";

if ( line1 =~ /Cats(.*)/ )
  puts "Line1 starts with Cats"
end
if ( line2 =~ /Cats(.*)/ )
  puts "Line2 starts with Dogs"
end

···

2012/9/11 Sybren Kooistra <lists@ruby-forum.com>

Works like a charm :slight_smile:

Any idea how I can save boolean (in some cases I do not want the content
of a node, but I only want to know IF a certain word IS or IS NOT part
of the html)

--
Posted via http://www.ruby-forum.com/\.

You can use Regex for that.

"Иван Бишевац" <ivan.bisevac@gmail.com> wrote in post #1074911:

I've done similar job recently. I've copied part of code so look at it
maybe it could help you:
require 'nokogiri'
require 'spreadsheet'

class XmlParsing
  def test
    row = 0
    column = 0
    book = Spreadsheet::Workbook.new
    sheet = book.create_worksheet

    Dir.chdir("xml")
    puts Dir.pwd
    Dir.glob("*.xml") do |file|
      f = File.open(file)
      doc = Nokogiri::XML(f)
      fullname_node = doc.at_xpath("//full_name")
      sheet[row,column] = fullname_node.content
      f.close
      row += 1
    end
    book.write 'spreadsheet.xls'
  end
end

xml = XmlParsing.new
xml.test

I'm not sure if it works because I've deleted some code from original
script but this is general idea. It script parses xml document and puts
information into excel.

Plus you need spreadsheet gem. You can install it with:
sudo gem install spreadsheet.

Or there is better way to do it. Make one file named Gemfile and put
this
content into it:
source 'https://rubygems.org'

gem 'nokogiri'
gem 'spreadsheet'

Save file and go to console cd to dir where Gemfile is located and run:
bundle install

This is preferred way to go because there is no need to install gems one
by
one manually. Instead when you want to run your software on different
machine you just invoke bundle install.

Hope it helps.

Good stuff! This ran nicely with the exception of "content" element
generating a nil exception but my XML file was not compliant to XML
rules.

Ruby is a nice language for beginners. I am new user for 1 day now and I
have learned so much from everyone! Try to avoid saying that you are
new. This is due to a user named Ryan Daves<sp?>. He talks very negative
to users who do that. Also another ruby forum you might want to
reference is at tek-tips.com. The users are helpful there as well. Good
luck. -Michelle

···

2012/9/6 Sybren Kooistra <lists@ruby-forum.com>

--
Posted via http://www.ruby-forum.com/\.

Another possibility would be to use each_with_index.

Dir.glob['*.html'].each_with_index do |document, row|

Jesus.

···

On Sun, Sep 9, 2012 at 2:31 PM, Иван Бишевац <ivan.bisevac@gmail.com> wrote:

require 'nokogiri'
require 'spreadsheet'

Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet

# Numbering is zero based. This means that first row is labeled 0, first
column 0.
row = 0

Dir.chdir("anattempt")
Dir.glob['*.html'].each do |document|