Using Nokogiri to scrape multiple websites

Hi everyone,

I'm looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?

I have tried doing something like this without much luck...

def index
  sites = Array.new("site1.com","site2.com","site3.com")
  sites.each do |site|
    @textlinks << scrape(site)
  end
end

def scrape(website)
  require 'open-uri'
  require 'nokogiri'

    doc = Nokogiri::HTML(open(website))
  return doc.xpath('//a')
  end

Thanks

···

--
Posted via http://www.ruby-forum.com/.

What is exactly the problem? You need to write the full url starting
with http:// so that open-uri works correctly. After that, scrape will
return an array of nokogiri elements each of them representing a link.
You are then putting each of these arrays into another array called
@textlinks. In orther to output the links to the screen, take a look
at the to_html method of Nokogiri::XML::Element.
This worked for me:

sites = %w{http://www.google.com http://www.yahoo.com}
links =
sites.each {|site| links.concat(scrape(site))} # the scrape method is
the one you wrote above
links.each {|link| puts link.to_html}

Hope this helps,

Jesus.

···

On Sat, Sep 4, 2010 at 4:24 PM, Ryan Mckenzie <ryan@souliss.com> wrote:

Hi everyone,

I'm looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?

I have tried doing something like this without much luck...

def index
sites = Array.new("site1.com","site2.com","site3.com")
sites.each do |site|
@textlinks << scrape(site)
end
end

def scrape(website)
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(website))
return doc.xpath('//a')
end

Hi Jesús,

I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array

Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
    keywords = %w{accounts resources membership}
    sites = %w{http://www.google.com http://www.yahoo.com}
  links = []
  sites.each {|site| links.concat(scrape(site, keywords[]))}
  end

def scrape(website,inputtext)
    require 'open-uri'
  require 'nokogiri'

    doc = Nokogiri::HTML(open(website))

  for sample in doc.xpath('//a')
    if sample.text == inputtext
      keywords = doc.xpath('//a')
    else
      keywords = "MISSING"
    end
  end
  end

Thanks for your time.

McKenzie

···

--
Posted via http://www.ruby-forum.com/.

So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I'd do something like:

def index
  keywords = %w{accounts resources membership}
  sites = %w{http://www.google.com http://www.yahoo.com}
  links_by_site = Hash.new {|h,k| h[k] = {}}
  sites.each do |site|
    keywords.each do |keyword|
      links[site][keyword] = scrape(site, keyword)
    end
  end
  links
end

def scrape(website,inputtext)
  require 'open-uri' #these could maybe go at the start of the script
  require 'nokogiri'

  regex = /#{inputtext}/
  links_that_match =
  doc = Nokogiri::HTML(open(website))
  doc.xpath('//a').each do |link|
    if regex =~ link.inner_text
     links_that_match << link.to_html
   end
  end
  links_that_match
end

Untested, but it can give you some ideas. The resulting hash will have
something like:

{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}

Jesus.

···

On Mon, Sep 6, 2010 at 5:01 PM, Ryan Mckenzie <ryan@souliss.com> wrote:

Hi Jesús,

I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array

Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links =
sites.each {|site| links.concat(scrape(site, keywords))}
end

def scrape(website,inputtext)
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(website))

for sample in doc.xpath('//a')
if sample.text == inputtext
keywords = doc.xpath('//a')
else
keywords = "MISSING"
end
end
end

Thanks for your time.

Jesús Gabriel y Galán wrote:

Hi Jes�s,

I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array

Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
� �keywords = %w{accounts resources membership}
� �sites = %w{http://www.google.com http://www.yahoo.com}
�links =
�sites.each {|site| links.concat(scrape(site, keywords))}
�end

def scrape(website,inputtext)
� �require 'open-uri'
�require 'nokogiri'

� �doc = Nokogiri::HTML(open(website))

�for sample in doc.xpath('//a')
� �if sample.text == inputtext
� � �keywords = doc.xpath('//a')
� �else
� � �keywords = "MISSING"
� �end
�end
�end

Thanks for your time.

So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I'd do something like:

def index
  keywords = %w{accounts resources membership}
  sites = %w{http://www.google.com http://www.yahoo.com}
  links_by_site = Hash.new {|h,k| h[k] = {}}
  sites.each do |site|
    keywords.each do |keyword|
      links[site][keyword] = scrape(site, keyword)
    end
  end
  links
end

def scrape(website,inputtext)
  require 'open-uri' #these could maybe go at the start of the script
  require 'nokogiri'

  regex = /#{inputtext}/
  links_that_match =
  doc = Nokogiri::HTML(open(website))
  doc.xpath('//a').each do |link|
    if regex =~ link.inner_text
     links_that_match << link.to_html
   end
  end
  links_that_match
end

Untested, but it can give you some ideas. The resulting hash will have
something like:

{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}

Jesus.

That works great! Thank you.

Instead of having to pull the items from a hash though I would really
like to try pull them from a database for when the list gets extremely
large. I've tried using the hash to pull from a variable but it produces
an error which says the hash is an odd length. It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)

def index
  @links = Hash.new { |ha,lnk| ha[lnk] = {} }
  @backlinks = Backlink.find(:all)
  keywords = %w{@backlinks.concat(title)}
  sites = %w{@backlinks.concat(permalink)}
  links_by_site = Hash.new {|h,k| h[k] = {}}
  sites.each do |site|
    keywords.each do |keyword|
      @links[site][keyword] = scrape(site, keyword)
    end
  end

Thanks again.

McKenzie

···

On Mon, Sep 6, 2010 at 5:01 PM, Ryan Mckenzie <ryan@souliss.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

Jesús Gabriel y Galán wrote:

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end

def scrape(website,inputtext)
require 'open-uri' #these could maybe go at the start of the script
require 'nokogiri'

regex = /#{inputtext}/
links_that_match =
doc = Nokogiri::HTML(open(website))
doc.xpath('//a').each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end

Untested, but it can give you some ideas. The resulting hash will have
something like:

{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}

Jesus.

That works great! Thank you.

Instead of having to pull the items from a hash though I would really
like to try pull them from a database for when the list gets extremely
large. I've tried using the hash to pull from a variable but it produces
an error which says the hash is an odd length.

I don't understand what you mean here.

It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)

def index
@links = Hash.new { |ha,lnk| ha[lnk] = {} }
@backlinks = Backlink.find(:all)
keywords = %w{@backlinks.concat(title)}
sites = %w{@backlinks.concat(permalink)}

irb(main):004:0> keywords = %w{@backlinks.concat(title)}
=> ["@backlinks.concat(title)"]

You probably mean:

keywords = @backlinks.map {|bl| bl.title}
sites = @backlinks.map {|bl| bl.permalink}

but I don't know exactly what @backlinks is (probably an ActiveRecord?)

links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
@links[site][keyword] = scrape(site, keyword)
end
end

This code produces the error: "hash is an odd number length"?

Jesus.

···

On Tue, Sep 7, 2010 at 11:42 AM, Ryan Mckenzie <ryan@souliss.com> wrote: