Nokogiri html xpath gestalt

Ralph_Shnelvar · 17 December 2017 06:53

Nokogiri html xpath gestalt
Hi Ruby folks,

I’m missing some important point(s) (gestalt) about Nokogiri.

I want to do scraping from a website.

Assume the HTML looks something like the following:

 <a href="/agents/Alice">Alice</a>

<a href="tel:(800) 555-9600"></a>

 <a href="/agents/Bob">Bob</a>

 <a href="/agents/Charlie">Charlie</a>

<a href="tel:(800) 555-9700"></a>

What should be noted is that Alice and Charlie have phone numbers but Bob does not. I want to pick up the data associated with Alice and Charlie but not Bob’s.

I can construct an xpath that picks up an array of 3 elements:

//div[@class=“data”]

nokogiri_object = Nokogiri::HTML(html_data)

elements = nokogiri_object.xpath(’//div[@class=“data”]’)

byebug # elements.size = 3; elements.class = Nokogiri::XML::NodeSet

elements.each_with_index do |e, ii| # e.class = Nokogiri::XML::Element

puts ii  # ii = 0,1,2

end

I am stumped how to look inside of e so that I only pick up Alice and Charlie.

Ralph

Marvin_Gulker · 17 December 2017 10:27

Hi,

Hi Ruby folks,

Please! Do not send HTML-only e-mails to the mailinglist! It is hard to
read on terminal clients like mutt! Configure your mail program to write
plain text e-mails.

What should be noted is that Alice and Charlie have phone numbers but
Bob does not. I want to pick up the data associated with Alice and
Charlie but not Bob's.
[...]
I am stumped how to look inside of e so that I only pick up Alice and
Charlie.

Get all the data nodes, then check if the phone subnode is there, and if
so, go down and read the name nodes. Example:

require "nokogiri"

    data=<<EOF
    <!DOCTYPE HTML>
    <html>
      <body>
        <div class="data">
          <div class="name">
            <a href="/agents/Alice">Alice</a>
          </div>
          <div class="phone">
           <a href="tel:(800) 555-9600"></a>
          </div>
        </div>
        <div class="data">
          <div class="name">
            <a href="/agents/Bob">Bob</a>
          </div>
        </div>
        <div class="data">
          <div class="name">
            <a href="/agents/Charlie">Charlie</a>
          </div>
          <div class="phone">
           <a href="tel:(800) 555-9700"></a>
          </div>
        </div>
      </body>
    </html>
    EOF

    html = Nokogiri::HTML(data)
    html.xpath("/html/body/div[@class='data']").each do |datanode|
      next unless datanode.at_xpath("div[@class='phone']")
      puts datanode.at_xpath("div[@class='name']/a").text
    end

I added the base structure to make your HTML snippet a valid HTML
document, but other than that, this snippet should give you the idea. It
looks for the data nodes, iterates them, then dismisses any nodes that
do not have a phone subnode. For the remaining subnodes, it extraces the
names and prints them to the standard output.

Greetings
Marvin

···

Am 16. December 2017 um 23:53 Uhr -0700 schrieb Ralph Shnelvar <ralphs@dos32.com>:

--
Blog: https://www.guelkerdev.de
PGP/GPG ID: F1D8799FBCC8BC4F

Robert_K1 · 17 December 2017 11:14

Hi Ruby folks,

Please! Do not send HTML-only e-mails to the mailinglist! It is hard to
read on terminal clients like mutt! Configure your mail program to write
plain text e-mails.

+1

What should be noted is that Alice and Charlie have phone numbers but
Bob does not. I want to pick up the data associated with Alice and
Charlie but not Bob's.
[...]
I am stumped how to look inside of e so that I only pick up Alice and
Charlie.

Get all the data nodes, then check if the phone subnode is there, and if
so, go down and read the name nodes. Example:

    html = Nokogiri::HTML(data)
    html.xpath("/html/body/div[@class='data']").each do |datanode|
      next unless datanode.at_xpath("div[@class='phone']")
      puts datanode.at_xpath("div[@class='name']/a").text
    end

This can be done elegantly with XPath:

html.xpath("/html/body/div[@class='data' and
./div[@class='phone']]").each do |datanode|
puts datanode.at_xpath("div[@class='name']/a/text()")
end

Note: I find this site highly helpful in learning XPath:

Kind regards

robert

···

On Sun, Dec 17, 2017 at 11:27 AM, Marvin Gülker <m-guelker@phoenixmail.de> wrote:

Am 16. December 2017 um 23:53 Uhr -0700 schrieb Ralph Shnelvar <ralphs@dos32.com>:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Topic		Replies	Views
Extracting some text from HTML ruby-talk	2	142	2 November 2010
Nokogiri help parsing HTML ruby-talk	17	509	29 March 2013
Help missing something BASIC ruby-talk	10	98	21 October 2010
Using Nokogiri ruby-talk	17	112	13 November 2009
Nokogiri not pulling correct XPath ruby-talk	4	162	1 March 2011

Nokogiri html xpath gestalt

//div[@class=“data”]

Related topics