This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts "********"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*********"
end
Results as expected
I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr ...> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).
I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT")); # file containing web
page
doc.xpath("//table[@class='result']").each do |node| # select a table
puts "*************"
puts node.to_html # as expected
puts node.xpath("//td[@class='itemNumbr']") # 15 per each
puts "*************"
end
---------------------- NOKOGIRI CODE ----------------------
The output below dispays the table HTML as expected - but not itemnumbrs
***********
<table ..................../table> for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......
The tables are outputted as expected Tables with itemnumbr 1 to 15
sequentially.
The node.xpath("//td[@class='itemNumbr']") acts as if node contains all
15 tables but the output indicates otherwise. I think node should
always contain HTML for a single table only, but I appear to be wrong.
Also if i put a subscript on the first xpath
doc.xpath("//table[@class='result'][5]").each do |node|
to ensure only one table is found, still get itemnumbrs for all 15 table
elements
Please do not shout. You are probably missing how XPath works. With
the queries given by you above you will always get *all* td nodes with
class "itemNumbr" in the document. You need a two level approach
using *relative* queries:
irb(main):007:0> doc.xpath('//table').each do |tab|
irb(main):008:1* p tab, tab.xpath('td') # relative!
irb(main):009:1> puts '----'
irb(main):010:1> end
#<Nokogiri::XML::Element:0x82319e4 name="table"
children=[#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>]>
[#<Nokogiri::XML::Element:0x823185c name="td"
children=[#<Nokogiri::XML::Text:0x82316d4 "1">]>]
···
On Wed, Oct 20, 2010 at 3:40 AM, Don Norcott <dnorcott@mts.net> wrote:
This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts "********"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*********"
end
Results as expected
********
3
*********
********
6
*********
********
9
*********
I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr ...> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).
I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT")); # file containing web
page
doc.xpath("//table[@class='result']").each do |node| # select a table
puts "*************"
puts node.to_html # as expected
puts node.xpath("//td[@class='itemNumbr']") # 15 per each
puts "*************"
end
---------------------- NOKOGIRI CODE ----------------------
The output below dispays the table HTML as expected - but not itemnumbrs
***********
<table ..................../table> for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......
The tables are outputted as expected Tables with itemnumbr 1 to 15
sequentially.
The node.xpath("//td[@class='itemNumbr']") acts as if node contains all
15 tables but the output indicates otherwise. I think node should
always contain HTML for a single table only, but I appear to be wrong.
Also if i put a subscript on the first xpath
doc.xpath("//table[@class='result'][5]").each do |node|
to ensure only one table is found, still get itemnumbrs for all 15 table
elements
I now have the code working with one additional line.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
# next line has been added.
doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")
puts "*************"
puts doc2.xpath("//td[@class='itemNumbr']")
puts "*************"
end
I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.
What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>
Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single <table>.
Also if print the node it contains only a single <table>.
Yet if I attempt to execute
"puts node.xpath("//td[@class='itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.
I now have the code working with one additional line.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
# next line has been added.
doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")
I'm sorry but this is ridiculous!
puts "*************"
puts doc2.xpath("//td[@class='itemNumbr']")
puts "*************"
end
I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.
As I said: you need a *relative XPath*. Your problem is the global
XPath. You need to shave off the leading "//" or prefix it with ".".
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
puts "*************"
puts node.xpath("td[@class='itemNumbr']")
puts "*************"
end
doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|
puts "*************"
puts node.xpath(".//td[@class='itemNumbr']")
puts "*************"
end
What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>
Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single <table>.
Also if print the node it contains only a single <table>.
Yet if I attempt to execute
"puts node.xpath("//td[@class='itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.
From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.
That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.
Obviously I still do not understand the contents of node.
From this posting of yours it's not clear to me what issue you have. Any XML or HTML parser that rips a document apart and builds a DOM of some kind will create a nested, strictly hierarchical tree structure. The only thing that may seem odd is that XPath queries beginning with "//" search through the complete document regardless of the node you invoke the method on.
Cheers
robert
···
On 20.10.2010 21:46, Don Norcott wrote:
I have actualy taken this first tutorial plus a few more
From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.
That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.
Obviously I still do not understand the contents of node.
My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.
My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class='itemNumbr']")
What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).
This is my last post - I really have no issue other than the conceptual
one above and I will re-visit the contents of node again when I have
time.
My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.
My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class='itemNumbr']")
What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).
The problem might lie in the term "contains". Conceptually one would
probably say that a node contains all its sub nodes. Technically a
node can also (indirectly) contain the whole document. This happens
if you include a reference to the parent node or the document. Here's
an example with parent node inclusion.
If you add a line "pp ch" to the iteration code at the end of the
file, you will see that each node "contains" all the rest of the
document.
This is my last post - I really have no issue other than the conceptual
Hopefully not.
Thanks again for all your help.
Your welcome!
Kind regards
robert
···
On Thu, Oct 21, 2010 at 12:18 AM, Don Norcott <dnorcott@mts.net> wrote: