Anyone scraping dynamic AJAX sites?

Becca_Girl · 30 November 2008 00:25

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.

···

--
Posted via http://www.ruby-forum.com/.

G_F · 30 November 2008 01:02

The problem is you need a DOM-aware Javascript interpreter in your
code to execute the Javascript, manipulate the DOM in the HTML, and
then allow you to extract the data you need.

There are projects like Rhino, which is a Javascript engine you can
embed in other apps, but you still won't have the DOM of the page nor
will you be able to manipulate it then extract the values, at least as
far as I understand.

You could use something like Ruby driving some sort of WebKit
interface on Mac OS or Linux, but I have no idea where to start. That,
to me, seems like the best answer. Maybe even a Ruby-based Cocoa app
would be the trick.

···

On Nov 29, 5:25 pm, Becca Girl <csch...@yahoo.com> wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.
--
Posted viahttp://www.ruby-forum.com/.

Brabuhr · 30 November 2008 01:31

http://code.google.com/p/firewatir/

···

On Sat, Nov 29, 2008 at 7:25 PM, Becca Girl <cschall@yahoo.com> wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Peter_Szinek3 · 30 November 2008 13:28

scRUBYt! - http://scrubyt.org

e.g. scraping your linkedin contacts:

require 'rubygems'
require 'scrubyt'

property_data = Scrubyt::Extractor.define :agent => :firefox do

   fetch 'https://www.linkedin.com/secure/login'
   fill_textfield 'session_key', '****'
   fill_textfield 'session_password', '****'
   submit

click_link_and_wait 'Connections', 5

   vcard "//li[@class='vcard']" do
     first_name "//span[@class='given-name']"
     second_name "//span[@class='family-name']"
     email "//a[@class='email']"
   end

end

puts property_data.to_xml

Cheers,
Peter

···

___
http://www.rubyrailways.com
http://scrubyt.org

On 2008.11.30., at 1:25, Becca Girl wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.
--
Posted via http://www.ruby-forum.com/\.

Kyle_Schmitt · 1 December 2008 02:39

As gf pointed out, the problem is you need a full DOM and working
javascript for this, sometimes even working css, to really do it
properly, you need a full blown, fully supported, web browser.

Short story, use the WATIR library to interact with your browser's DOM
to do this.

http://wtr.rubyforge.org/

I used to do this all the time for work, in a testing capacity. I
tried a number of diferent solutions, and found WATIR far superior to
anything else out there, including the very pricey pay packages. If
you cut through all the marketing BS, half the pay-packages are
functional the same as WATIR, and the other half are more primitive.

--Kyle

···

On Sat, Nov 29, 2008 at 6:25 PM, Becca Girl <cschall@yahoo.com> wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.
--
Posted via http://www.ruby-forum.com/\.

Daniel_Finnie · 20 December 2008 23:14

If the site is truely AJAX, i.e. the data is loaded from an HTTP call from
JavaScript, you could monitor the HTTP requests made by the browser. On
Firefox, I use the LiveHTTPHeaders extension. Just load go
view-->sidebar-->HTTP Headers, load the page with whatever data, and look
thru the requests for anything interesting.

I used this method to get Facebook contact info and it worked fairly well.
As a bonus, any data found with this method is usually in a very
machine-understandable format like JSON or RSS. There are Ruby libraries
for both.

Dan

···

On Sat, Nov 29, 2008 at 7:25 PM, Becca Girl <cschall@yahoo.com> wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.
--
Posted via http://www.ruby-forum.com/\.

Peter_Szinek3 · 1 December 2008 09:48

Just for completeness sake: scRUBYt! (since 0.4.05) is using FireWatir as the agent (or mechanize - you can choose whether you want scrape AJAX or not) so you can do full blown AJAX scraping - but with a scraping DSL which usually speeds up the scraper creation, especially in the case of complicated scrapers.

Cheers,
Peter

···

___
http://www.rubyrailways.com
http://scrubyt.org

On 2008.12.01., at 3:39, Kyle Schmitt wrote:

On Sat, Nov 29, 2008 at 6:25 PM, Becca Girl <cschall@yahoo.com> wrote:

Hello.

Is there anyone who has successfully found a way to scrape a dynamically
generated AJAX web site? If I view the source, it gives me the
variables. If I use Firebug to view the DOM, it gives me the actual
values. Any ideas?

Thanks.
--
Posted via http://www.ruby-forum.com/\.

As gf pointed out, the problem is you need a full DOM and working
javascript for this, sometimes even working css, to really do it
properly, you need a full blown, fully supported, web browser.

Short story, use the WATIR library to interact with your browser's DOM
to do this.

http://wtr.rubyforge.org/

I used to do this all the time for work, in a testing capacity. I
tried a number of diferent solutions, and found WATIR far superior to
anything else out there, including the very pricey pay packages. If
you cut through all the marketing BS, half the pay-packages are
functional the same as WATIR, and the other half are more primitive.

--Kyle

Kyle_Schmitt · 1 December 2008 15:10

Peter
Neat. I'll have to give that a try next time I need to revisit scraping.

···

On Mon, Dec 1, 2008 at 3:48 AM, Peter Szinek <peter@rubyrailways.com> wrote:

Just for completeness sake: scRUBYt! (since 0.4.05) is using FireWatir as
the agent (or mechanize - you can choose whether you want scrape AJAX or
not) so you can do full blown AJAX scraping - but with a scraping DSL which
usually speeds up the scraper creation, especially in the case of
complicated scrapers.

Cheers,
Peter

Florian_Gilcher · 1 December 2008 15:23

Actually, firewatir and scRUBYt! are nice.

But is there a possibility to start firefox with a second profile (so that it circumvents the "one instance"-rule) and rendering to a hidden display? [1][2]

Otherwise, this really hurts testablity (as the browser might retain your personal session) and usability on a deployment server.

Regards,
Florian Gilcher

[1]: Preferably a virtal one on a console-only machine.
[2]: Sadly, afaik, firefox has no hidden-mode.

···

On Dec 1, 2008, at 10:48 AM, Peter Szinek wrote:

Just for completeness sake: scRUBYt! (since 0.4.05) is using FireWatir as the agent (or mechanize - you can choose whether you want scrape AJAX or not) so you can do full blown AJAX scraping - but with a scraping DSL which usually speeds up the scraper creation, especially in the case of complicated scrapers.

Cheers,
Peter
___
http://www.rubyrailways.com
http://scrubyt.org

Brabuhr · 2 December 2008 22:16

http://coderrr.wordpress.com/2007/10/15/patch-to-firewatir-and-jssh-to-support-testing-with-multiple-concurrent-firefox-browsers/

···

On Mon, Dec 1, 2008 at 10:23 AM, Florian Gilcher <flo@andersground.net> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Actually, firewatir and scRUBYt! are nice.

But is there a possibility to start firefox with a second profile (so that
it circumvents the "one instance"-rule) and rendering to a hidden display?
[1][2]

Otherwise, this really hurts testablity (as the browser might retain your
personal session) and usability on a deployment server.

Regards,
Florian Gilcher

[1]: Preferably a virtal one on a console-only machine.
[2]: Sadly, afaik, firefox has no hidden-mode.

Will_Simpson · 20 December 2008 12:27

[1]: Preferably a virtal one on a console-only machine.
[2]: Sadly, afaik, firefox has no hidden-mode.

You could try using a virtual frame buffer if you are using Linux or
similar.

Xfvb :99 -ac &
export DISPLAY=:99

Will

···

--
Posted via http://www.ruby-forum.com/\.

Brabuhr · 21 December 2008 00:19

[1]: Preferably a virtal one on a console-only machine.
[2]: Sadly, afaik, firefox has no hidden-mode.

I've never used it, but Celerity appears to have Javascript support:

http://celerity.rubyforge.org/

You could try using a virtual frame buffer if you are using Linux or
similar.

Xfvb :99 -ac &
export DISPLAY=:99

Or, start a vncserver with xstartup set to launch the scraper script.

···

On Sat, Dec 20, 2008 at 7:27 AM, Will Simpson <will1@wjrs.co.uk> wrote:

Topic		Replies	Views
Scrape data from Javascript inside HTML source ruby-talk	1	121	11 April 2009
[ANN] Rails demo application for web page scraping ruby-talk	0	110	7 March 2006
AJAX without Rails ruby-talk	24	184	19 March 2008
New guy question ruby-talk	0	66	20 December 2006
Scraping with Nokogiri for dynamic page(?) ruby-talk	2	150	14 June 2012

Anyone scraping dynamic AJAX sites?

Related topics