Hi everyone,
I'm looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?
I have tried doing something like this without much luck...
def index
sites = Array.new("site1.com","site2.com","site3.com")
sites.each do |site|
@textlinks << scrape(site)
end
end
def scrape(website)
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(website))
return doc.xpath('//a')
end
Thanks
···
--
Posted via http://www.ruby-forum.com/.
What is exactly the problem? You need to write the full url starting
with http:// so that open-uri works correctly. After that, scrape will
return an array of nokogiri elements each of them representing a link.
You are then putting each of these arrays into another array called
@textlinks. In orther to output the links to the screen, take a look
at the to_html method of Nokogiri::XML::Element.
This worked for me:
sites = %w{http://www.google.com http://www.yahoo.com}
links =
sites.each {|site| links.concat(scrape(site))} # the scrape method is
the one you wrote above
links.each {|link| puts link.to_html}
Hope this helps,
Jesus.
···
On Sat, Sep 4, 2010 at 4:24 PM, Ryan Mckenzie <ryan@souliss.com> wrote:
Hi everyone,
I'm looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?
I have tried doing something like this without much luck...
def index
sites = Array.new("site1.com","site2.com","site3.com")
sites.each do |site|
@textlinks << scrape(site)
end
end
def scrape(website)
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(website))
return doc.xpath('//a')
end
Hi Jesús,
I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array
Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.
So something like
def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links = []
sites.each {|site| links.concat(scrape(site, keywords[]))}
end
def scrape(website,inputtext)
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(website))
for sample in doc.xpath('//a')
if sample.text == inputtext
keywords = doc.xpath('//a')
else
keywords = "MISSING"
end
end
end
Thanks for your time.
McKenzie
···
--
Posted via http://www.ruby-forum.com/.
So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I'd do something like:
def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end
def scrape(website,inputtext)
require 'open-uri' #these could maybe go at the start of the script
require 'nokogiri'
regex = /#{inputtext}/
links_that_match =
doc = Nokogiri::HTML(open(website))
doc.xpath('//a').each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end
Untested, but it can give you some ideas. The resulting hash will have
something like:
{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}
Jesus.
···
On Mon, Sep 6, 2010 at 5:01 PM, Ryan Mckenzie <ryan@souliss.com> wrote:
Hi Jesús,
I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array
Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.
So something like
def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links =
sites.each {|site| links.concat(scrape(site, keywords))}
end
def scrape(website,inputtext)
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(website))
for sample in doc.xpath('//a')
if sample.text == inputtext
keywords = doc.xpath('//a')
else
keywords = "MISSING"
end
end
end
Thanks for your time.
Jesús Gabriel y Galán wrote:
Hi Jes�s,
I'm looking to output the information to an .html document (using the
Rails framework) and I'm getting the following error: can't convert
Fixnum into Array
Also what I'm actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.
So something like
def index
� �keywords = %w{accounts resources membership}
� �sites = %w{http://www.google.com http://www.yahoo.com}
�links =
�sites.each {|site| links.concat(scrape(site, keywords))}
�end
def scrape(website,inputtext)
� �require 'open-uri'
�require 'nokogiri'
� �doc = Nokogiri::HTML(open(website))
�for sample in doc.xpath('//a')
� �if sample.text == inputtext
� � �keywords = doc.xpath('//a')
� �else
� � �keywords = "MISSING"
� �end
�end
�end
Thanks for your time.
So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I'd do something like:
def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end
def scrape(website,inputtext)
require 'open-uri' #these could maybe go at the start of the script
require 'nokogiri'
regex = /#{inputtext}/
links_that_match =
doc = Nokogiri::HTML(open(website))
doc.xpath('//a').each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end
Untested, but it can give you some ideas. The resulting hash will have
something like:
{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}
Jesus.
That works great! Thank you.
Instead of having to pull the items from a hash though I would really
like to try pull them from a database for when the list gets extremely
large. I've tried using the hash to pull from a variable but it produces
an error which says the hash is an odd length. It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)
def index
@links = Hash.new { |ha,lnk| ha[lnk] = {} }
@backlinks = Backlink.find(:all)
keywords = %w{@backlinks.concat(title)}
sites = %w{@backlinks.concat(permalink)}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
@links[site][keyword] = scrape(site, keyword)
end
end
Thanks again.
McKenzie
···
On Mon, Sep 6, 2010 at 5:01 PM, Ryan Mckenzie <ryan@souliss.com> wrote:
--
Posted via http://www.ruby-forum.com/\.
Jesús Gabriel y Galán wrote:
def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end
def scrape(website,inputtext)
require 'open-uri' #these could maybe go at the start of the script
require 'nokogiri'
regex = /#{inputtext}/
links_that_match =
doc = Nokogiri::HTML(open(website))
doc.xpath('//a').each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end
Untested, but it can give you some ideas. The resulting hash will have
something like:
{"http://www.google.com" => {"accounts" => [<some links containing the
word accounts>], "resources" => [<idem for resources>]
...
}
Jesus.
That works great! Thank you.
Instead of having to pull the items from a hash though I would really
like to try pull them from a database for when the list gets extremely
large. I've tried using the hash to pull from a variable but it produces
an error which says the hash is an odd length.
I don't understand what you mean here.
It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)
def index
@links = Hash.new { |ha,lnk| ha[lnk] = {} }
@backlinks = Backlink.find(:all)
keywords = %w{@backlinks.concat(title)}
sites = %w{@backlinks.concat(permalink)}
irb(main):004:0> keywords = %w{@backlinks.concat(title)}
=> ["@backlinks.concat(title)"]
You probably mean:
keywords = @backlinks.map {|bl| bl.title}
sites = @backlinks.map {|bl| bl.permalink}
but I don't know exactly what @backlinks is (probably an ActiveRecord?)
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
@links[site][keyword] = scrape(site, keyword)
end
end
This code produces the error: "hash is an odd number length"?
Jesus.
···
On Tue, Sep 7, 2010 at 11:42 AM, Ryan Mckenzie <ryan@souliss.com> wrote: