Net::http.get has a 50K limit?

I’m trying to write a screenscraper and am getting a 50K limit on the data
returned

require “net/http"
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(”/wl/jobs/JS_JobSearch?TS=1012409733026")
data=response.body
puts data.length
}
rescue => err
puts "Error: #{err}"
exit
end

The last line returns 52166 . (the file is considerable bigger) What did I
do wrong?

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has changed
from ruby 1.6

regs
Vivek

···

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What did I
do wrong?

Accept that some days you are the pigeon and some days the statue

I tried your suggestion (ie. using a block and putting it to a file: see
copy below) but it still cuts the page short just as before. The actual web
page is ~58K but I’m only getting ~51K. Any more suggestions?

require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil) {

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What did
I

···

do wrong?

Accept that some days you are the pigeon and some days the statue

Yes the server completes the document. Browser serves it nicely, and I also
verified via wget, which give the complete copy. I still don’t know why
Net::http prematurely cuts off the document.

meihua

“Robert Klemme” bob.news@gmx.net wrote in message
news:bve0ff$qsaln$1@ID-52924.news.uni-berlin.de

···

“Meihua Liang” mliang@cox.net schrieb im Newsbeitrag
news:YXuSb.11834$fZ6.7987@lakeread06…

I tried your suggestion (ie. using a block and putting it to a file:
see
copy below) but it still cuts the page short just as before. The actual
web
page is ~58K but I’m only getting ~51K. Any more suggestions?

Did you verify with wget that the server actually serves the complete
document? If not, that’s what I’d do.

robert

require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil)
{

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has
changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What
did
I
do wrong?

Accept that some days you are the pigeon and some days the statue

“Meihua Liang” mliang@cox.net schrieb im Newsbeitrag
news:YXuSb.11834$fZ6.7987@lakeread06…

I tried your suggestion (ie. using a block and putting it to a file:
see
copy below) but it still cuts the page short just as before. The actual
web
page is ~58K but I’m only getting ~51K. Any more suggestions?

Did you verify with wget that the server actually serves the complete
document? If not, that’s what I’d do.

robert

require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil)
{

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has
changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What
did

···

I

do wrong?

Accept that some days you are the pigeon and some days the statue

Hi!

Meihua Liang wrote:

Yes the server completes the document. Browser serves it nicely, and I
also verified via wget, which give the complete copy. I still don’t know
why Net::http prematurely cuts off the document.

I played around with your script a bit, and noticed something strange: When
trying to fetch the file via telnet, it is also cut off early. However, as
you said, wget correctly retrieves the whole document. Why? wget sends a
user-agent header field, and only in this case the whole document is
served. So, by adding a user-agent header field to your request, it works
for me (with Ruby 1.8):

response = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”,
{“user-agent” => “blub”})

returns around 60000 bytes in response.body.
When writing www spiders, you sometimes have to outsmart the webservers ;).

Hth,
Daniel