Scrapping data from a webpage where the data is loaded dynamically

7stud2 · 6 February 2014 12:06

Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then "https://www.kleyntrucks.com/trucks/tractorunit/".
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field "Matriculation year" as
2003 to 2005, and then if you look at the url, you still would see that
url is "https://www.kleyntrucks.com/trucks/tractorunit/". Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require 'open-uri'
doc =
Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"))

On the other hand - the
website(http://www.flipkart.com/mobiles/samsung~brand/pr?sid=tyy,4io&otracker=hp_nmenu_sub_electronics_0_Samsung)
seems good. Now suppose I want to scrap the page, when **Price** has
been selected between **5001-10000**, then I would get also its
equivalent url from the browser -
(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%5B%5D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0).

Accordingly I can use this url in below, then all my code will get
correct data :

require 'open-uri'
doc =
Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%5B%5D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/) ? Is there anyway
?

···

--
Posted via http://www.ruby-forum.com/.

Hansem1 · 6 February 2014 12:47

Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then "https://www.kleyntrucks.com/trucks/tractorunit/"\.
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field "Matriculation year" as
2003 to 2005, and then if you look at the url, you still would see that
url is "https://www.kleyntrucks.com/trucks/tractorunit/"\. Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require 'open-uri'
doc =
Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"\))

On the other hand - the
website(

http://www.flipkart.com/mobiles/samsung~brand/pr?sid=tyy,4io&otracker=hp_nmenu_sub_electronics_0_Samsung
)

seems good. Now suppose I want to scrap the page, when **Price** has
been selected between **5001-10000**, then I would get also its
equivalent url from the browser -
(http://www.flipkart.com/mobiles/samsung~brand/pr?p

=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0).

Accordingly I can use this url in below, then all my code will get
correct data :

require 'open-uri'
doc =
Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p

=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/\) ? Is there anyway
?

--
Posted via http://www.ruby-forum.com/\.

My first thought is that you are going to have to open the browser,
interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

here is an example:

require 'watir-webdriver'

@browser = Watir::Browser.new :chrome
@browser.goto 'https://www.kleyntrucks.com/trucks/tractorunit/'

sleep 2
xpath_matriculation_year = '//*[@id="imprp0"]/div[1]'
@browser.div(xpath: xpath_matriculation_year).click

xpath_beginning_year = '//*[@id="imprp0"]/div[2]/div/div[5]/div[1]/input'
@browser.text_field(xpath: xpath_beginning_year).set 2003

xpath_ending_year = '//*[@id="imprp0"]/div[2]/div/div[5]/div[2]/input'
@browser.text_field(xpath: xpath_ending_year).set 2005

# odd, but needed or the page refresh resets the value of a field you set
if you don't leave the field
@browser.text_field(xpath: xpath_beginning_year).click

sleep 5
page_html = @browser.html

^ then use the page_html in nokogiri

Michael

···

________
Michael Hansen

Shane · 6 February 2014 12:18

you could use mechanize which will allow you to click buttons, fill forms etc. prior to parsing:
http://mechanize.rubyforge.org/GUIDE_rdoc.html

or if javascript support is required you could use watir to load your page before parsing with nokogiri:
http://watirwebdriver.com/

···

On 2014-02-06 12:06, Arup Rakshit wrote:

Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then "https://www.kleyntrucks.com/trucks/tractorunit/"\.
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field "Matriculation year" as
2003 to 2005, and then if you look at the url, you still would see that
url is "https://www.kleyntrucks.com/trucks/tractorunit/"\. Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require 'open-uri'
doc =

Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"\))

On the other hand - the

website(http://www.flipkart.com/mobiles/samsung~brand/pr?sid=tyy,4io&otracker=hp_nmenu_sub_electronics_0_Samsung\)
seems good. Now suppose I want to scrap the page, when **Price** has
been selected between **5001-10000**, then I would get also its
equivalent url from the browser -

(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0\).

Accordingly I can use this url in below, then all my code will get
correct data :

require 'open-uri'
doc =

Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0\))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/\) ? Is there anyway
?

7stud2 · 6 February 2014 13:37

unknown wrote in post #1135823:

How can I handle this situation ?

Posted via http://www.ruby-forum.com/\.

My first thought is that you are going to have to open the browser,
interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

Yes, 'selenium-webdriver' or 'watir-webdriver' will be helpful in this
regard. But I am looking for a way to do this in any other way or not.
Can this lib will be helpful -
Class: Net::HTTP (Ruby 2.1.0) ?

Or please tell me what are the other options, you meant to say.

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 6 February 2014 13:55

If no JavaScript is required, then mechanize is a quick and invisible
alternative to watir. Have you tried that yet?

···

--
Posted via http://www.ruby-forum.com/.

Joseph_Phillips · 8 February 2014 05:56

From: "HANSEM1@nationwide.com" <HANSEM1@nationwide.com>
To: ruby-talk@ruby-lang.org
Sent: Thursday, 6 February 2014 11:47 PM
Subject: Re: Scrapping data from a webpage where the data is loaded dynamically

Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then "https://www.kleyntrucks.com/trucks/tractorunit/"\.
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field "Matriculation year" as
2003 to 2005, and then if you look at the url, you still would see that
url is "https://www.kleyntrucks.com/trucks/tractorunit/"\. Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require 'open-uri'
doc =
Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"\))

On the other hand - the
website(http://www.flipkart.com/mobiles/samsung~brand/pr?sid=tyy,4io&otracker=hp_nmenu_sub_electronics_0_Samsung\)
seems good. Now suppose I want to scrap the page, when **Price** has
been selected between **5001-10000**, then I would get also its
equivalent url from the browser -

(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f->9c43497931b0\).

Accordingly I can use this url in below, then all my code will get
correct data :

require 'open-uri'
doc =

Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68->4ba9\-9b3f\-9c43497931b0\))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/\) ? Is there anyway
?

--
Posted via http://www.ruby-forum.com/\.

My first thought is that you are going to have to open the browser, interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

here is an example:

require 'watir-webdriver'

@browser = Watir::Browser.new :chrome
@browser.goto 'https://www.kleyntrucks.com/trucks/tractorunit/'

sleep 2
xpath_matriculation_year = '//*[@id="imprp0"]/div[1]'
@browser.div(xpath: xpath_matriculation_year).click

xpath_beginning_year = '//*[@id="imprp0"]/div[2]/div/div[5]/div[1]/input'
@browser.text_field(xpath: xpath_beginning_year).set 2003

xpath_ending_year = '//*[@id="imprp0"]/div[2]/div/div[5]/div[2]/input'
@browser.text_field(xpath: xpath_ending_year).set 2005

# odd, but needed or the page refresh resets the value of a field you set if you don't leave the field
@browser.text_field(xpath: xpath_beginning_year).click

sleep 5
page_html = @browser.html

^ then use the page_html in nokogiri

Michael
________
Michael Hansen

Hi,

I do this kind of web data aggregation daily, though at the moment I'm using Python.

This is more of a workflow issue than one solved with code.

What I do is open Firebug (or Devtools etc) and look at the net requests as I interact with the page.

You will have to locate which response brings back the partial markup/json/xml that the page renders in this case. Once you know the URL that returns the data, you need to look at the request to see the parameters it passes. Then you've got what you need to scrape the data using the right partial/api call.

Cheers,

Joe

7stud2 · 8 February 2014 06:48

The problem is when I set the the search field "Matriculation year" as
2005 to 2014, I am getting the url
"https://www.kleyntrucks.com/truck/add-facet-value/field/imprp0/from/2005"
from the firebug network tab(request), and if I open also the response
tab, I am getting the correct html, as it is showing in the page.

But I am getting completely different response html, when I am doing the
below :

require "net/http"
require "uri"

uri =
URI.parse("https://www.kleyntrucks.com/truck/add-facet-value/field/imprp0/from/2005")

response = Net::HTTP.get_response(uri)

File.open("/home/kirti/input.txt",'w') do |file|
file.puts response.body
end

And that's the main problem. Why I am not getting correct response as
showing firebug network tab ?

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 8 February 2014 06:15

Joseph Phillips wrote in post #1136006:

criterion. Suppose I set the the search field "Matriculation year" as

You will have to locate which response brings back the partial
markup/json/xml that the page renders in this case. Once you know the
URL that returns the data, you need to look at the request to see the
parameters it passes. Then you've got what you need to scrape the data
using the right partial/api call.

Cheers,

Joe

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Using Nokogiri ruby-talk	17	112	13 November 2009
Scraping with Nokogiri for dynamic page(?) ruby-talk	2	150	14 June 2012
How to scrap data with Nokogiri from this page? ruby-talk	2	138	19 April 2010
Parsing Newb Help ruby-talk	4	119	5 September 2012
How do I get open-uri to deliver the same html as what Firefox is seeing? ruby-talk	1	367	21 December 2017

Scrapping data from a webpage where the data is loaded dynamically

Related topics