Screen Scraping Advice

Charles_Pareto · 17 September 2007 17:25

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

···

--
Posted via http://www.ruby-forum.com/.

James_Britt · 17 September 2007 18:35

Doesn't sound like much scraping, just searching text for a string. You could even do a lot of that work with Google.
but just download the file and search for a string. create a data file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM structure they have to get (for example) the weather report.

···

On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?
--
Posted via http://www.ruby-forum.com/\.

Brabuhr · 17 September 2007 19:52

If someone knows of a super library that can recognize and interact
with arbitrary search forms, I would love to see it

My first suggestion would be to write a simple script using Mechanize
to connect to the homepage of each site in an input list and check for
any forms. Bin the sites into three groups (no forms, at least one
form matching the regex /search/i, and at least one form). Then start
by just focusing at the ones which appear to have some sort of search
form (which may be a small or a large subset :-).

···

On 9/17/07, Charles Pareto <chuckdawit@gmail.com> wrote:

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

franco · 18 September 2007 01:40

Hpricot
http://code.whytheluckystiff.net/hpricot/ is a great screen scrape
library for ruby.

scraping might not be the best approach because each site/page uses a
different layout, therefore the same scrape recipe probably won't work
for another page.

you could scrape froogle (google products?) or some other aggregate
consumer sales site. it will have one interface and probably a lot of
data. you might want to see if there are web services for froogle,
usually better than scraping.

···

On Sep 17, 1:25 pm, Charles Pareto <chuckda...@gmail.com> wrote:

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?
--
Posted viahttp://www.ruby-forum.com/.

Glenn_Gillen · 20 September 2007 09:33

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

I'm slightly biased, but scrubyt should be able to do most of the remaining heavy lifting for you

http://scrubyt.org/

Glenn

Charles_Pareto · 17 September 2007 18:52

John Joyce wrote:

···

On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:

equipment. So I started by downloading the DNS list of all domain
Posted via http://www.ruby-forum.com/\.

Doesn't sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.

Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won't show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.
--
Posted via http://www.ruby-forum.com/\.

Brabuhr · 20 September 2007 18:54

> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?

I'm slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

On that note:

require "rubygems"
require "scrubyt"

froogle_data = Scrubyt::Extractor.define do
  fetch "http://www.google.com/products"
  fill_textfield "q", "WIC-1T"
  submit

  info do
    product "WIC-1T"
    vendor "NEW2U Hardware from ..."
    price "$40.00"
  end
  next_page "Next", :limit => 10
end

puts froogle_data.to_xml

(tons of improvement needed, but):

<root>
  <info>
    <product>WIC-1T</product>
    <vendor>NEW2U Hardware from ...</vendor>
    <price>$40.00</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>ATS Computer Systems...</vendor>
    <price>$353.95</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>eBay</vendor>
    <price>$49.95</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>eBay</vendor>
    <price>$149.99</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>PCsForEveryone.com</vendor>
    <price>$337.07</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>COL - Computer Onlin...</vendor>
    <price>$149.00</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>eCOST.com</vendor>
    <price>$297.14</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>eBay</vendor>
    <price>$45.00</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>ATACOM</vendor>
    <price>$291.95</price>
  </info>
  <info>
    <product>WIC-1T</product>
    <vendor>Express IT Options</vendor>
    <price>$216.44</price>
  </info>
</root>

···

On 9/20/07, Glenn Gillen <glenn.gillen@gmail.com> wrote:

Konrad_Meyer · 17 September 2007 19:26

Quoth Chuck Dawit:

John Joyce wrote:
>
>> equipment. So I started by downloading the DNS list of all domain
>> Posted via http://www.ruby-forum.com/\.
>>
> Doesn't sound like much scraping, just searching text for a string.
> You could even do a lot of that work with Google.
> but just download the file and search for a string. create a data
> file of your own that tells you what line you found the string.
> Scraping is really for getting data from other sites, using the DOM
> structure they have to get (for example) the weather report.

Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won't show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.

Do a search on froogle for "cisco productname" with the max price set at
60% MSRP. Should turn up a few hits.

HTH,

···

> On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertillnoon.com/

James_Britt · 17 September 2007 19:44

What I mean is, scraping usually relies on the document's structure in some way. Without looking at the structure that a give site uses (a given page if it isn't a templated dynamically generated page) there is no way to know what corresponds to what. Page structure is pretty arbitrary. Presentation and structure don't necessarily correspond well, or in a way you could guess.
Ironically, the better their web designers, the easier it will be.

But if you are talking about searching a dynamically generated site, you still have to find out if it has a search mechanism, what does it call the form field and submit buttons? The names in html can be arbitrary, especially if they use graphic buttons.

If you have long list of products to search for, you will still save yourself some work, but scraping involves some visual inspection of pages and page source to get things going. Be aware that their sysadmin may spot you doing a big blast of searches all at once and block you from the site. If they check their logs and see that somebody is searching for all cisco stuff, in an automated fashion, they might just block you anyway, whether or not they are legit themselves. Many sysadmins don't like bots searching their databases! They might see it as searching for exploits.

···

On Sep 17, 2007, at 1:52 PM, Chuck Dawit wrote:

John Joyce wrote:

On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:

equipment. So I started by downloading the DNS list of all domain
Posted via http://www.ruby-forum.com/\.

Doesn't sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.

Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won't show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.
--
Posted via http://www.ruby-forum.com/\.

Glenn_Gillen · 21 September 2007 15:45

It's by no means a silver bullet, but could very well get you 80% there. Setup a basic learning extract that is fairly generic looking for terms you know will exist on the domains you want (say a model number and a dollar sign?), have it loop over the URLs with products on them, output the learner to production extractor and then tweak the sites that aren't giving you the exact results you want.

Or, make life easier if you can and let froogle put it all into a single format for you.

Best of luck,

Glenn

···

On 20/09/2007, at 7:54 PM, brabuhr@gmail.com wrote:

give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

On 9/20/07, Glenn Gillen <glenn.gillen@gmail.com> wrote:

I'm slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

On that note:

<snip>
(tons of improvement needed, but):
<snip>

Topic		Replies	Views
Scraping web pages for cisco products ruby-talk	15	104	21 September 2007
Simple screen scraper using scrAPI ruby-talk	14	120	30 November 2006
Ruby screen scraping ruby-talk	27	104	21 November 2006
Scrapping data from a webpage where the data is loaded dynamically ruby-talk	7	166	8 February 2014
Screen-scraping ruby-talk	5	57	19 February 2007

Screen Scraping Advice

Related topics