Parse both string and url using Nokogiri xpath

7stud2 · 11 May 2013 23:37

ruby 1.9.3
nokogiri 1.5.5

Say, a web page has a link,

<a href="http://example.com">reference</a>

I would like to get both the url and text, "http://example.com" and
"reference".

First, access to the page that contains this link.

doc = Nokogiri::HTML(open(url))

then,

name = doc.xpath('//div.../a').text
url = doc.xpath('//div.../a/@href).text

It works. But the problem is this is parsing twice separately.
If you want to apply the same procedure to many links that exist in a
single page, it seems inefficient.

Is there anyway to produce both url and text by single parse? like

def parse_link_and_text (xpath)
...
end

p parse_link_and_text('//div...')

gives a hash

=> {'reference' => 'http://example.com'}

?

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 12 May 2013 08:45

Just search for <a> and go from there.

$ irb -r nokogiri
irb(main):001:0> dom = Nokogiri.HTML('<x><a href="link">text</a></x>')
=> #<Nokogiri::HTML::Document:0x434197c name="document"
children=[#<Nokogiri::XML::DTD:0x43411d4 name="html">,
#<Nokogiri::XML::Element:0x433df20 name="html"
children=[#<Nokogiri::XML::Element:0x433daac name="body"
children=[#<Nokogiri::XML::Element:0x433d48a name="x"
children=[#<Nokogiri::XML::Element:0x433cfee name="a"
attributes=[#<Nokogiri::XML::Attr:0x433b086 name="href" value="link">]
children=[#<Nokogiri::XML::Text:0x433be5a "text">]>]>]>]>]>
irb(main):002:0> node = dom.at_xpath '//a'
=> #<Nokogiri::XML::Element:0x433cfee name="a"
attributes=[#<Nokogiri::XML::Attr:0x433b086 name="href" value="link">]
children=[#<Nokogiri::XML::Text:0x433be5a "text">]>
irb(main):003:0> node[:href]
=> "link"
irb(main):004:0> node.text
=> "text"
irb(main):005:0>

Now, what is so difficult about that? You can easily find out more via
documentation.

Cheers

robert

···

On Sun, May 12, 2013 at 1:37 AM, Soichi Ishida <lists@ruby-forum.com> wrote:

ruby 1.9.3
nokogiri 1.5.5

Say, a web page has a link,

<a href="http://example.com">reference</a>

I would like to get both the url and text, "http://example.com" and
"reference".

First, access to the page that contains this link.

doc = Nokogiri::HTML(open(url))

then,

name = doc.xpath('//div.../a').text
url = doc.xpath('//div.../a/@href).text

It works. But the problem is this is parsing twice separately.
If you want to apply the same procedure to many links that exist in a
single page, it seems inefficient.

Is there anyway to produce both url and text by single parse? like

def parse_link_and_text (xpath)
...
end

p parse_link_and_text('//div...')

gives a hash

=> {'reference' => 'http://example.com'}

?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

windwiny · 12 May 2013 09:13

Mayby using a temp variable ?

links = doc.xpath('//div/a[@href]')
links.map do |x| [x.text,x['href']] end => [["reference", "
http://example.com"]]

···

2013/5/12 Soichi Ishida <lists@ruby-forum.com>

ruby 1.9.3
nokogiri 1.5.5

Say, a web page has a link,

<a href="http://example.com">reference</a>

I would like to get both the url and text, "http://example.com" and
"reference".

First, access to the page that contains this link.

doc = Nokogiri::HTML(open(url))

then,

name = doc.xpath('//div.../a').text
url = doc.xpath('//div.../a/@href).text

It works. But the problem is this is parsing twice separately.
If you want to apply the same procedure to many links that exist in a
single page, it seems inefficient.

Is there anyway to produce both url and text by single parse? like

def parse_link_and_text (xpath)
...
end

p parse_link_and_text('//div...')

gives a hash

=> {'reference' => 'http://example.com'}

?

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 16 May 2013 05:51

Thanks both replies are helpful!

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Using Nokogiri to scrape multiple websites ruby-talk	5	127	7 September 2010
Nokogiri help parsing HTML ruby-talk	17	480	29 March 2013
How to get Nokogiri to resolve XPath references when parsing XML? ruby-talk	1	124	31 January 2011
Extracting some text from HTML ruby-talk	2	123	2 November 2010
Reading and using urls from txtfile ruby-talk	2	134	9 October 2012

Parse both string and url using Nokogiri xpath

Related Topics