Net::http.get has a 50K limit?

Meihua_Liang · 30 January 2004 05:14

I’m trying to write a screenscraper and am getting a 50K limit on the data
returned

require “net/http"
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(”/wl/jobs/JS_JobSearch?TS=1012409733026")
data=response.body
puts data.length
}
rescue => err
puts "Error: #{err}"
exit
end

The last line returns 52166 . (the file is considerable bigger) What did I
do wrong?

Vivek_Nallur1 · 30 January 2004 06:07

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has changed
from ruby 1.6

regs
Vivek

···

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What did I
do wrong?

–

Accept that some days you are the pigeon and some days the statue

Meihua_Liang · 30 January 2004 21:24

I tried your suggestion (ie. using a block and putting it to a file: see
copy below) but it still cuts the page short just as before. The actual web
page is ~58K but I’m only getting ~51K. Any more suggestions?

require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil) {

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM…

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What did
I

···

do wrong?

–

Accept that some days you are the pigeon and some days the statue

Meihua_Liang · 30 January 2004 19:49

Yes the server completes the document. Browser serves it nicely, and I also
verified via wget, which give the complete copy. I still don’t know why
Net::http prematurely cuts off the document.

meihua

“Robert Klemme” bob.news@gmx.net wrote in message
news:bve0ff$qsaln$1@ID-52924.news.uni-berlin.de…

···

“Meihua Liang” mliang@cox.net schrieb im Newsbeitrag
news:YXuSb.11834$fZ6.7987@lakeread06…

I tried your suggestion (ie. using a block and putting it to a file:
see
copy below) but it still cuts the page short just as before. The actual
web
page is ~58K but I’m only getting ~51K. Any more suggestions?

Did you verify with wget that the server actually serves the complete
document? If not, that’s what I’d do.
robert
require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil)
{

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM…

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has
changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What
did
I
do wrong?

–

Accept that some days you are the pigeon and some days the statue

Robert · 30 January 2004 21:14

“Meihua Liang” mliang@cox.net schrieb im Newsbeitrag
news:YXuSb.11834$fZ6.7987@lakeread06…

I tried your suggestion (ie. using a block and putting it to a file:
see
copy below) but it still cuts the page short just as before. The actual
web
page is ~58K but I’m only getting ~51K. Any more suggestions?

Did you verify with wget that the server actually serves the complete
document? If not, that’s what I’d do.

robert

require “net/http”

using block

Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
File.open(‘result.txt’, ‘wb+’) {|f|
resp,=http.get(‘/wl/jobs/JS_JobSearch?TS=1012409733026’,nil)
{

str>
f.print( str )
}
}
}

“Vivek Nallur” nvivek@ncst.ernet.in wrote in message
news:Pine.WNT.4.58.0401301052560.536@Vajra.CDACMUMBAI.CDACINDIA.COM…

Just a small change required.

require “net/http”
begin
Net::HTTP.start(“www.washingtonpost.com”, 80){ |http|
response , = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”)

File.open(“/some/file/”,“wb+”){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}

data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has
changed
from ruby 1.6

regs
Vivek

puts data.length
}
rescue => err
puts “Error: #{err}”
exit
end

The last line returns 52166 . (the file is considerable bigger) What
did

···

I

do wrong?

–

Accept that some days you are the pigeon and some days the statue

Daniel_Lichtenberger · 31 January 2004 00:39

Hi!

Meihua Liang wrote:

Yes the server completes the document. Browser serves it nicely, and I
also verified via wget, which give the complete copy. I still don’t know
why Net::http prematurely cuts off the document.

I played around with your script a bit, and noticed something strange: When
trying to fetch the file via telnet, it is also cut off early. However, as
you said, wget correctly retrieves the whole document. Why? wget sends a
user-agent header field, and only in this case the whole document is
served. So, by adding a user-agent header field to your request, it works
for me (with Ruby 1.8):

response = http.get(“/wl/jobs/JS_JobSearch?TS=1012409733026”,
{“user-agent” => “blub”})

returns around 60000 bytes in response.body.
When writing www spiders, you sometimes have to outsmart the webservers ;).

Hth,
Daniel

Topic		Replies	Views
Net::HTTP transfer limit ruby-talk	0	65	19 November 2005
Maximum read size with Net::HTTP.get? ruby-talk	0	108	18 November 2005
Net:HTTP performance downloading large files ruby-talk	8	110	24 November 2006
Limiting download size with net::http ruby-talk	0	132	22 June 2010
Net/http performance question ruby-talk	2	75	31 October 2006

Net::http.get has a 50K limit?

using block

using block

using block

Related topics