Ruby-doc.org content snarfing

I received an alert E-mail today telling me that ruby-doc.org had exceeded its alloted bandwidth. There could be all sorts of reasons for this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and domain names. I've also turned off access to the Euroko 2003 videos until next month.

I really don't think this abuse is coming from any regular reader of this list, but on the off chance that I'm wrong: please stop it.

Thanks,

James Britt

jbritt AT ruby-doc DOT org

James -- is there a possibility of you providing a tarball of the whole
site for download?

I'd be more than willing to host a tracker and seed for a Bit Torrent
copy of it.

Ari

···

On Sat, 2004-08-21 at 06:25 +0900, James Britt wrote:

I received an alert E-mail today telling me that ruby-doc.org had
exceeded its alloted bandwidth. There could be all sorts of reasons for
this, and if it were simply due to popularity I'd be thrilled. But it
appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and
domain names. I've also turned off access to the Euroko 2003 videos
until next month.

I really don't think this abuse is coming from any regular reader of
this list, but on the off chance that I'm wrong: please stop it.

James Britt wrote:

this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.

Not sure if this is the problem, but I noticed that http://www.ruby-doc.org/robots.txt doesn't exist. Is it getting hit by search bots?

James Britt wrote:

I received an alert E-mail today telling me that ruby-doc.org had exceeded its alloted bandwidth. There could be all sorts of reasons for this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.

You should look at mod_throttle and set up a few .htaccess rules against annoying web spiders:

SetEnvIf user-agent MSIECrawler keep_out
SetEnvIf user-agent ^Teleport keep_out
SetEnvIf user-agent ^WebStripper keep_out
SetEnvIf user-agent ^Offline keep_out
SetEnvIf user-agent HTTrack keep_out
SetEnvIf user-agent Xaldon keep_out
SetEnvIf user-agent WebCopier keep_out

<Limit GET POST >
order allow,deny
allow from all
deny from env=keep_out
</Limit>

James Britt wrote:

I received an alert E-mail today telling me that ruby-doc.org had exceeded its alloted bandwidth. There could be all sorts of reasons for this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and domain names. I've also turned off access to the Euroko 2003 videos until next month.

I really don't think this abuse is coming from any regular reader of this list, but on the off chance that I'm wrong: please stop it.

Thanks,

James Britt

jbritt AT ruby-doc DOT org

I highly recommend you look into mod_dosevasive.

If there was a decent way to verify the authenticity of the source IP addresses (ie not spoofed), then blocking would be a great first step.

The next step might be posting the verified abusive addresses online (so the rest of us can take appropriate action like blocking them from our sites) or submitting them to dshield.org. This might be annoying enough for them to move on to other targets.

David Morton wrote:

James Britt wrote:

this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.

Not sure if this is the problem, but I noticed that http://www.ruby-doc.org/robots.txt doesn't exist. Is it getting hit by search bots?

Crawlers and spiders and such are fine. They hit the site often.

I've never seem this sort of downloading, though. And the legit spiders
(i.e., the ones likely to respect a robots.txt file) tend to identify
themselves as such. This one didn't.

That's not to say that a robots.txt file wouldn't be a bad thing, just
that spiders have never been a problem.

James

Aredridel wrote:

James -- is there a possibility of you providing a tarball of the whole
site for download?

I'd be more than willing to host a tracker and seed for a Bit Torrent
copy of it.

A tarball of the whole site would be close to 5 GB. That's including the
videos. Torrents for the videos might be a good idea; I'm not sure a
torrent for anything else is all that useful. There are assorted
bundles and stand-alone files that are easy to download as needed. Much
of the site's content consists of links to other places. There's the
HTML version of Programming Ruby, and the core and standard lib docs.

Each of these gets updated on a different schedule. A monolithic
tarball would be out of date fairly quick.

In actual practice, the traffic has been fine. It's just this one time
somebody decided to go grab what appears to be *everything*, all in one day.

Thank you for the offer, though. When I first started hosting videos on
the site was unfamiliar with bittorrent. Now, though, it seems as if
it would be a good option for the larger files.

James

From what has been said, I doubt that the person who did this was being malicious... Just ignorant. That's something that I might have done, before got experience as a webmaster and realized how rotten it can be :slight_smile: So it might be a little bit of overkill to share their ip addresses for mass banning. Maybe a just good slap on the wrist, like redirecting all their page requests to very_stern_warning.text

Of course, I might be wrong, and they might actually be *wanting* to cause problems. In which case, they should be taken out and shot :smiley:

cheers,
Mark

···

On Aug 21, 2004, at 6:25 PM, Ruby Script wrote:

James Britt wrote:

I received an alert E-mail today telling me that ruby-doc.org had exceeded its alloted bandwidth. There could be all sorts of reasons for this, and if it were simply due to popularity I'd be thrilled. But it appears that someone has been running wget and snarfing the site wholesale.
This is a bad thing. I'm in the process of blocking IP addresses and domain names. I've also turned off access to the Euroko 2003 videos until next month.
I really don't think this abuse is coming from any regular reader of this list, but on the off chance that I'm wrong: please stop it.
Thanks,
James Britt
jbritt AT ruby-doc DOT org

I highly recommend you look into mod_dosevasive.

If there was a decent way to verify the authenticity of the source IP addresses (ie not spoofed), then blocking would be a great first step.

The next step might be posting the verified abusive addresses online (so the rest of us can take appropriate action like blocking them from our sites) or submitting them to dshield.org. This might be annoying enough for them to move on to other targets.

Aredridel wrote:
>
> James -- is there a possibility of you providing a tarball of the whole
> site for download?
>
> I'd be more than willing to host a tracker and seed for a Bit Torrent
> copy of it.

A tarball of the whole site would be close to 5 GB. That's including the
videos. Torrents for the videos might be a good idea; I'm not sure a
torrent for anything else is all that useful. There are assorted
bundles and stand-alone files that are easy to download as needed. Much
of the site's content consists of links to other places. There's the
HTML version of Programming Ruby, and the core and standard lib docs.

Yeah, torrents make sense for anything over 10mb.

Each of these gets updated on a different schedule. A monolithic
tarball would be out of date fairly quick.

Yeah -- a few large ones might appease those wanting a copy. I know I've
toyed with the idea. I'd love a copy of good chunks of the site for
offline reading.

In actual practice, the traffic has been fine. It's just this one time
somebody decided to go grab what appears to be *everything*, all in one day.

Ouch. That happened to me once...

Thank you for the offer, though. When I first started hosting videos on
the site was unfamiliar with bittorrent. Now, though, it seems as if
it would be a good option for the larger files.

Yeah.

Have a good one.

Ari

···

On Sat, 2004-08-21 at 15:49 +0900, James Britt wrote:

Mark Hubbart wrote:

From what has been said, I doubt that the person who did this was being malicious... Just ignorant. That's something that I might have done, before got experience as a webmaster and realized how rotten it can be :slight_smile: So it might be a little bit of overkill to share their ip addresses for mass banning. Maybe a just good slap on the wrist, like redirecting all their page requests to very_stern_warning.text

I've banned specific IP addresses and domain names. If I get complaints then perhaps I'll undo it. I don't really expect that.

As I don't know the identity of the people responsible, nor their motives, I've focusing on protecting the site. If I find reason to believe someone is being malicious or willfully thoughtless, I'll consider other action.

It may very well have been a site-grabber script gone bad. I don't know. I'm mainly concerned with preventing it in the future, and I appreciate the helpful comments I've received here.

Thanks,

James