Help missing something BASIC

Don_Norcott · 20 October 2010 01:40

This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
  puts "********"
  puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
  puts "*********"
end
Results as expected

···

********
3
*********
********
6
*********
********
9
*********

I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr ...> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).

I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT")); # file containing web
page
doc.xpath("//table[@class='result']").each do |node| # select a table
  puts "*************"
  puts node.to_html # as expected
   puts node.xpath("//td[@class='itemNumbr']") # 15 per each
  puts "*************"
end
---------------------- NOKOGIRI CODE ----------------------

The output below dispays the table HTML as expected - but not itemnumbrs
***********
<table ..................../table> for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......

The tables are outputted as expected Tables with itemnumbr 1 to 15
sequentially.
The node.xpath("//td[@class='itemNumbr']") acts as if node contains all
15 tables but the output indicates otherwise. I think node should
always contain HTML for a single table only, but I appear to be wrong.

Also if i put a subscript on the first xpath
doc.xpath("//table[@class='result'][5]").each do |node|
to ensure only one table is found, still get itemnumbrs for all 15 table
elements

WHAT AM I MISSING HERE

--
Posted via http://www.ruby-forum.com/.

Don_Norcott · 20 October 2010 01:50

Posted incorrect code for number array and should have said last item
not
middle item

s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
   puts "********
   puts itm.to_s # ADDED LINE
   puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
   puts "*********"
end

Giving the output below - more in line with table & item printout

···

********
[1, 2, 3] # table
3 # item selected
*********

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 20 October 2010 07:47

Please do not shout. You are probably missing how XPath works. With
the queries given by you above you will always get *all* td nodes with
class "itemNumbr" in the document. You need a two level approach
using *relative* queries:

irb(main):004:0> html = "<body>" << (1..3).map {|i|
"<table><td>#{i}</td></table>"}.join(" ") << "</body>"
=> "<body><table><td>1</td></table> <table><td>2</td></table>
<table><td>3</td></table></body>"

irb(main):005:0> doc = Nokogiri.parse html
=> #<Nokogiri::XML::Document:0x8231e0c name="document"
children=[#<Nokogiri::XML::Element:0x8231b6c name="body"
children=[#<Nokogiri::XML::Element:0x82319e4 name="table"
children=[#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>]>,
#<Nokogiri::XML::Text:0x8231496 " ">,
#<Nokogiri::XML::Element:0x82313fc name="table"
children=[#<Nokogiri::XML::Element:0x8231274 name="td"
children=[#<Nokogiri::XML::Text:0x82310ec "2">]>]>,
#<Nokogiri::XML::Text:0x8230eae " ">,
#<Nokogiri::XML::Element:0x8230e14 name="table"
children=[#<Nokogiri::XML::Element:0x8230c8c name="td"
children=[#<Nokogiri::XML::Text:0x8230b04 "3">]>]>]>]>

irb(main):006:0> doc.xpath '//td'
=> [#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>,
#<Nokogiri::XML::Element:0x8231274 name="td"
children=[#<Nokogiri::XML::Text:0x82310ec "2">]>,
#<Nokogiri::XML::Element:0x8230c8c name="td"
children=[#<Nokogiri::XML::Text:0x8230b04 "3">]>]

irb(main):007:0> doc.xpath('//table').each do |tab|
irb(main):008:1* p tab, tab.xpath('td') # relative!
irb(main):009:1> puts '----'
irb(main):010:1> end
#<Nokogiri::XML::Element:0x82319e4 name="table"
children=[#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>]>
[#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>]

···

On Wed, Oct 20, 2010 at 3:40 AM, Don Norcott <dnorcott@mts.net> wrote:

This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts "********"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*********"
end
Results as expected
********
3
*********
********
6
*********
********
9
*********

I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr ...> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).

I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT")); # file containing web
page
doc.xpath("//table[@class='result']").each do |node| # select a table
puts "*************"
puts node.to_html # as expected
puts node.xpath("//td[@class='itemNumbr']") # 15 per each
puts "*************"
end
---------------------- NOKOGIRI CODE ----------------------

The output below dispays the table HTML as expected - but not itemnumbrs
***********
<table ..................../table> for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......

The tables are outputted as expected Tables with itemnumbr 1 to 15
sequentially.
The node.xpath("//td[@class='itemNumbr']") acts as if node contains all
15 tables but the output indicates otherwise. I think node should
always contain HTML for a single table only, but I appear to be wrong.

Also if i put a subscript on the first xpath
doc.xpath("//table[@class='result'][5]").each do |node|
to ensure only one table is found, still get itemnumbrs for all 15 table
elements

WHAT AM I MISSING HERE

----
#<Nokogiri::XML::Element:0x82313fc name="table"
children=[#<Nokogiri::XML::Element:0x8231274 name="td"
children=[#<Nokogiri::XML::Text:0x82310ec "2">]>]>
[#<Nokogiri::XML::Element:0x8231274 name="td"
children=[#<Nokogiri::XML::Text:0x82310ec "2">]>]
----
#<Nokogiri::XML::Element:0x8230e14 name="table"
children=[#<Nokogiri::XML::Element:0x8230c8c name="td"
children=[#<Nokogiri::XML::Text:0x8230b04 "3">]>]>
[#<Nokogiri::XML::Element:0x8230c8c name="td"
children=[#<Nokogiri::XML::Text:0x8230b04 "3">]>]
----
=> 0

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Don_Norcott · 20 October 2010 14:14

Thanks Robert

I now have the code working with one additional line.

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|

# next line has been added.
doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")

    puts "*************"
    puts doc2.xpath("//td[@class='itemNumbr']")
    puts "*************"
end

I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.

What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>

Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single <table>.

Also if print the node it contains only a single <table>.

Yet if I attempt to execute
"puts node.xpath("//td[@class='itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.

Thanks for any insight you can provide.
Don

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 20 October 2010 15:24

Thanks Robert

I now have the code working with one additional line.

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|

# next line has been added.
doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")

I'm sorry but this is ridiculous!

puts "*************"
puts doc2.xpath("//td[@class='itemNumbr']")
puts "*************"
end

I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.

As I said: you need a *relative XPath*. Your problem is the global
XPath. You need to shave off the leading "//" or prefix it with ".".

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
   puts "*************"
   puts node.xpath("td[@class='itemNumbr']")
   puts "*************"
end

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
   puts "*************"
   puts node.xpath(".//td[@class='itemNumbr']")
   puts "*************"
end

What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>

Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single <table>.

Also if print the node it contains only a single <table>.

Yet if I attempt to execute
"puts node.xpath("//td[@class='itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.

http://www.w3schools.com/xpath/xpath_syntax.asp

Cheers

robert

···

On Wed, Oct 20, 2010 at 4:14 PM, Don Norcott <dnorcott@mts.net> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Don_Norcott · 20 October 2010 16:10

Thanks for your help

puts node.xpath(".//td[@class='itemNumbr']")

which selects current node first works nicely

I will spend some time at
http://www.w3schools.com/xpath/xpath_syntax.asp

Thanks Don

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 20 October 2010 16:50

There's also

http://zvon.org/xxl/XPathTutorial/General/examples.html
http://www.tizag.com/xmlTutorial/xpathtutorial.php

And tons more.

Cheers

robert

···

On 20.10.2010 18:10, Don Norcott wrote:

Thanks for your help

puts node.xpath(".//td[@class='itemNumbr']")

which selects current node first works nicely

I will spend some time at
http://www.w3schools.com/xpath/xpath_syntax.asp

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Don_Norcott · 20 October 2010 19:46

I have actualy taken this first tutorial plus a few more

XPath 教程

From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.

That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 20 October 2010 20:40

From this posting of yours it's not clear to me what issue you have. Any XML or HTML parser that rips a document apart and builds a DOM of some kind will create a nested, strictly hierarchical tree structure. The only thing that may seem odd is that XPath queries beginning with "//" search through the complete document regardless of the node you invoke the method on.

Cheers

robert

···

On 20.10.2010 21:46, Don Norcott wrote:

I have actualy taken this first tutorial plus a few more

XPath 教程

From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.

That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Don_Norcott · 20 October 2010 22:18

My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.

My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class='itemNumbr']")

What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

This is my last post - I really have no issue other than the conceptual
one above and I will re-visit the contents of node again when I have
time.

Thanks again for all your help.

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 21 October 2010 07:55

My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.

My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class='itemNumbr']")

What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

The problem might lie in the term "contains". Conceptually one would
probably say that a node contains all its sub nodes. Technically a
node can also (indirectly) contain the whole document. This happens
if you include a reference to the parent node or the document. Here's
an example with parent node inclusion.

gist.github.com

https://gist.github.com/rklemme/638085

node.rb

require 'pp'

Node = Struct.new :value, :parent, :children do
  def initialize(value = nil, parent = nil)
    self.value = value
    self.children = []
    yield self if block_given?
  end

  def add(child)

This file has been truncated. show original

If you add a line "pp ch" to the iteration code at the end of the
file, you will see that each node "contains" all the rest of the
document.

This is my last post - I really have no issue other than the conceptual

Hopefully not.

Thanks again for all your help.

Your welcome!

Kind regards

robert

···

On Thu, Oct 21, 2010 at 12:18 AM, Don Norcott <dnorcott@mts.net> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Nikogiri ruby-talk	20	580	14 August 2016
Select tr>3 with nokogiri ruby-talk	12	191	29 August 2010
Using Nokogiri ruby-talk	17	162	13 November 2009
How to write this xpath? ruby-talk	4	191	7 September 2010
Why does #content method in nokogiri not printing the full text? ruby-talk	18	242	29 May 2013

Help missing something BASIC

Related topics