I've been playing with Ruby and Nokogiri to crawl some website to get
text information, but after a while I realize that some of those
websites block my access while the script is running. Since the moment
they block the access, the script keeps running (cause I handled the
exception) but is getting what is suppose to.
After the block, if I try to access using the browser, I just can't, so
I guess they block the IP address, right?
Bombing a webserver in the fashion you describe is not advisable in any way. They're clearly not happy with what you're doing... so either lower the frequency of the requests or ask them directly about your needs - they might be willing to let you run your script more often or even give you the raw data directly.
···
--
Andrea Dallera
Il 23/11/2010 12:19, Luis G. ha scritto:
Hi there...
I've been playing with Ruby and Nokogiri to crawl some website to get
text information, but after a while I realize that some of those
websites block my access while the script is running. Since the moment
they block the access, the script keeps running (cause I handled the
exception) but is getting what is suppose to.
After the block, if I try to access using the browser, I just can't, so
I guess they block the IP address, right?
Run your crawler in steps (don't crawl the whole site, and only grab
what is new; that's what the Age header is for!), and respect
robots.txt.
Otherwise, well, you get what you deserve, if you hog a server's CPU
cycles and create a Denial of Service attack (nobody cares if it is by
accident or by design).
···
On Tue, Nov 23, 2010 at 12:19 PM, Luis G. <l17339@gmail.com> wrote:
Hi there...
But I still have the same problem: works in the beginning, but after a
while stops working.
I can just run the crawler in steps, to not do lots of calls to the
website in the same moment, but is kinda boring...
Any of you face the same problem? Any of you have a solution for this?
--
Phillip Gawlowski
Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
I thought the program I built was not so heavy for the website I'm
trying to get info from.
The thing is, I'm accessing the website to get information, but I just
access there to specific pages in that domain, so I'm not really
crawling everything.
I build the url based in some info I have in my DB and after I have the
url, I access to that url directly and collect the information in that
specific page. And more, the page I'm accessing have just some <p> html
tags, so not so much info to look into. And I'm just accessing the
web-pages I didn't access before (just the new ones).
So, of course I understand that they need to protect webserver, but I
think my program is not really a threat
I'm gonna run the script in steps and in different days like I thought
before and like you told me.
Yeah, that's one of the reasons I asked this question, because I though
that we can solve this issue just changing the user agent or the headers
or something... Like they have here: http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/
Anyway, now I'm using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like:
Actually I was blocked before, and is like during 24 hours or so.
But the thing is, I'm running the crawlers in a test server, and not in
the production one. And they are not under the same network, so the IP's
are different
One more thought, what are you using for user-agent? Some sites block empty
or known to be automated user-agents.
Regards,
Ammar
···
On Tue, Nov 23, 2010 at 2:19 PM, Luis G. <l17339@gmail.com> wrote:
Hey guys... Thanks for your replies.
I thought the program I built was not so heavy for the website I'm
trying to get info from.
The thing is, I'm accessing the website to get information, but I just
access there to specific pages in that domain, so I'm not really
crawling everything.
I build the url based in some info I have in my DB and after I have the
url, I access to that url directly and collect the information in that
specific page. And more, the page I'm accessing have just some <p> html
tags, so not so much info to look into. And I'm just accessing the
web-pages I didn't access before (just the new ones).
So, of course I understand that they need to protect webserver, but I
think my program is not really a threat
I'm gonna run the script in steps and in different days like I thought
before and like you told me.
It was just a guess. But, as you mentioned, your IP is being blocked, so
it's too late to change agents now. You may have been blocked for any reason
really, frequency of requests, user-agent, or something else entirely.
Usually such blocks are temporary (it could be a dynamic IP) so you could
try again later. But who knows how long it will take, or if you will be
blocked again.
Andrea's suggestion is probably your best bet, contact the owners of the
site and request access. You might find out why you got blocked and avoid it
in the future.
Regards,
Ammar
···
On Tue, Nov 23, 2010 at 2:46 PM, Luis G. <l17339@gmail.com> wrote:
Anyway, now I'm using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like: