I’m missing some important point(s) (gestalt) about Nokogiri.
I want to do scraping from a website.
Assume the HTML looks something like the following:
<a href="/agents/Alice">Alice</a>
<a href="tel:(800) 555-9600"></a>
<a href="/agents/Bob">Bob</a>
<a href="/agents/Charlie">Charlie</a>
<a href="tel:(800) 555-9700"></a>
What should be noted is that Alice and Charlie have phone numbers but Bob does not. I want to pick up the data associated with Alice and Charlie but not Bob’s.
I can construct an xpath that picks up an array of 3 elements:
//div[@class=“data”]
nokogiri_object = Nokogiri::HTML(html_data)
elements = nokogiri_object.xpath(’//div[@class=“data”]’)
Please! Do not send HTML-only e-mails to the mailinglist! It is hard to
read on terminal clients like mutt! Configure your mail program to write
plain text e-mails.
What should be noted is that Alice and Charlie have phone numbers but
Bob does not. I want to pick up the data associated with Alice and
Charlie but not Bob's.
[...]
I am stumped how to look inside of e so that I only pick up Alice and
Charlie.
Get all the data nodes, then check if the phone subnode is there, and if
so, go down and read the name nodes. Example:
html = Nokogiri::HTML(data)
html.xpath("/html/body/div[@class='data']").each do |datanode|
next unless datanode.at_xpath("div[@class='phone']")
puts datanode.at_xpath("div[@class='name']/a").text
end
I added the base structure to make your HTML snippet a valid HTML
document, but other than that, this snippet should give you the idea. It
looks for the data nodes, iterates them, then dismisses any nodes that
do not have a phone subnode. For the remaining subnodes, it extraces the
names and prints them to the standard output.
Greetings
Marvin
···
Am 16. December 2017 um 23:53 Uhr -0700 schrieb Ralph Shnelvar <ralphs@dos32.com>:
Please! Do not send HTML-only e-mails to the mailinglist! It is hard to
read on terminal clients like mutt! Configure your mail program to write
plain text e-mails.
+1
What should be noted is that Alice and Charlie have phone numbers but
Bob does not. I want to pick up the data associated with Alice and
Charlie but not Bob's.
[...]
I am stumped how to look inside of e so that I only pick up Alice and
Charlie.
Get all the data nodes, then check if the phone subnode is there, and if
so, go down and read the name nodes. Example:
html = Nokogiri::HTML(data)
html.xpath("/html/body/div[@class='data']").each do |datanode|
next unless datanode.at_xpath("div[@class='phone']")
puts datanode.at_xpath("div[@class='name']/a").text
end
This can be done elegantly with XPath:
html.xpath("/html/body/div[@class='data' and
./div[@class='phone']]").each do |datanode|
puts datanode.at_xpath("div[@class='name']/a/text()")
end
Note: I find this site highly helpful in learning XPath:
Kind regards
robert
···
On Sun, Dec 17, 2017 at 11:27 AM, Marvin Gülker <m-guelker@phoenixmail.de> wrote:
Am 16. December 2017 um 23:53 Uhr -0700 schrieb Ralph Shnelvar <ralphs@dos32.com>: