Please e-mail Google to help the Ruby Garden Wiki

Hello,

I have sent the following e-mail message to suggestions@google.com. It's an
idea that could dramatically help reduce the Wiki Spam problem on the Ruby
Garden Wiki. The spam problem on that Wiki is getting really bad:

···

*********

Hello Google,

Because of the Google Page Rank land grab, there are web sites running
scripts to deface popular Wikis with links to their site. For a dramatic
example look at the Revision history page for the Ruby Garden Wiki:

http://www.rubygarden.org/ruby?RecentChanges

The problem is, even though, we diligently delete the spam as it shows up,
most Wikis archive the old revisions in a revision list. Google (you) crawls
these revision list pages and finds the deleted spam links. In fact, you
find a lot of them because the spammers keep coming back and we keep
deleting them, creating lots of revision history pages that you crawl.

Here's a VERY SIMPLE way for you to help out the thousands of Wikis out
there.

Allow the Wiki owners to add an HTML tag to the web page called
<NO_EXTERNAL_PAGE_RANK/>. If you find this tag on the web page, then only
for off-site URL's, you do not follow any external links on the page, or
pass on any page rank to the target web site.

The Wiki owners could place this HTML tag in the non-editable portion of a
web page. So on the main page they would NOT use this HTML tag so that page
rank would be properly passed on. But they would put it on any revision
history pages so they would not.

Once this gets around the spammers would probably check for the tag too with
this script, and leave the Wiki(s) alone. It's kind of like the "this house
protected by Almar security" sticker.

This would be a very useful feature for Bloggers and their blog comments,
and discussion forums too.

*********

If you like the idea, then others of you might want to send a similar e-mail
too, to apply (positive) pressure to Google.

Thank.
--
Robert

This would make much more sense as a 'meta' tag value, or even as a
field in the robots.txt file for a site, than it would as an extension
to HTML.

Speaking of which, has anyone considered just rewriting the robots.txt
to block access to everything but the current version of each page?

Lennon

Robert Oschler wrote:

Because of the Google Page Rank land grab, there are web sites running
scripts to deface popular Wikis with links to their site. For a dramatic
example look at the Revision history page for the Ruby Garden Wiki:

http://www.rubygarden.org/ruby?RecentChanges

The problem is, even though, we diligently delete the spam as it shows up,
most Wikis archive the old revisions in a revision list. Google (you) crawls
these revision list pages and finds the deleted spam links. In fact, you
find a lot of them because the spammers keep coming back and we keep
deleting them, creating lots of revision history pages that you crawl.

Here's a VERY SIMPLE way for you to help out the thousands of Wikis out
there.

robots.txt ?

Google adheres to that very strongly. and I notice there's no http://www.rubygarden.org/robots.txt

- Greg

This is already adequately handled by both robots.txt and
<META NAME="ROBOTS">. (Described elsewhere in this thread.)

Even with an extension to HTML, spammers will still spam wikis
because Google is not the only search engine, and the spammers are too
lazy to check for such extensions.

(I get Referer spammers on my personal website, probably because my
stats are publicly accessible. They continue to spam despite the
Disallow in my robots.txt.)

···

Robert Oschler (no_replies@fake_email_address.invalid) wrote:

I have sent the following e-mail message to suggestions@google.com. It's an
idea that could dramatically help reduce the Wiki Spam problem on the Ruby
Garden Wiki. The spam problem on that Wiki is getting really bad:

*********

Allow the Wiki owners to add an HTML tag to the web page called
<NO_EXTERNAL_PAGE_RANK/>. If you find this tag on the web page, then only
for off-site URL's, you do not follow any external links on the page, or
pass on any page rank to the target web site.

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Glancing at the specs, it seems that the benefits of someone posting external links could be removed by a combination of wise robots.txt settings and a redirect page for external links. Or, one could use the meta tags that do the same thing:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

This should keep any compliant search engine (including Google) from analyzing a page for links. Which should prevent the pageranking.

If, however, some external links should be respected, there's the redirect trick. External links go to a page which redirects to the link. That way, you can allow certain urls (links to rubycentral, ruby-lang, etc.) to be read, but links to unknown sites could be filtered out, by placing meta tags correctly.

cheers,
Mark

···

On Jul 20, 2004, at 10:07 AM, Greg Millam wrote:

Robert Oschler wrote:

Because of the Google Page Rank land grab, there are web sites running
scripts to deface popular Wikis with links to their site. For a dramatic
example look at the Revision history page for the Ruby Garden Wiki:
http://www.rubygarden.org/ruby?RecentChanges
The problem is, even though, we diligently delete the spam as it shows up,
most Wikis archive the old revisions in a revision list. Google (you) crawls
these revision list pages and finds the deleted spam links. In fact, you
find a lot of them because the spammers keep coming back and we keep
deleting them, creating lots of revision history pages that you crawl.
Here's a VERY SIMPLE way for you to help out the thousands of Wikis out
there.

robots.txt ?

Google adheres to that very strongly. and I notice there's no http://www.rubygarden.org/robots.txt

FAQ: Crawling und Indexierung in der Google Suche | Google Search Central  |  Support  |  Google for Developers

"Greg Millam" <walker@lethalcode.net> wrote in message
news:40FD513D.7040006@lethalcode.net...

robots.txt ?

Google adheres to that very strongly. and I notice there's no
http://www.rubygarden.org/robots.txt

FAQ: Crawling und Indexierung in der Google Suche | Google Search Central  |  Support  |  Google for Developers

- Greg

Greg,

I thought of that but robots.txt is by directory only isn't it, or can you
specify specific pages?

I was going for a solution that almost any wikimaster of any skill level
could implement. Also, if the <NO_EXTERNAL_PAGE_RANK/> tag usage was added
to the base Wiki software install (RuWiki, moin-moin, etc.), then the newbie
Wikimaster wouldn't have to do anything at all.

Thanks.

···

--
Robert

>Robert Oschler wrote:
>
>>Because of the Google Page Rank land grab, there are web sites
>>running scripts to deface popular Wikis with links to their site.
>>For a dramatic example look at the Revision history page for the
>>Ruby Garden Wiki:
>>
>>Captcha
>>
>>The problem is, even though, we diligently delete the spam as it
>>shows up, most Wikis archive the old revisions in a revision list.
>>Google (you) crawls these revision list pages and finds the deleted
>>spam links. In fact, you find a lot of them because the spammers
>>keep coming back and we keep deleting them, creating lots of
>>revision history pages that you crawl.
>>
>>Here's a VERY SIMPLE way for you to help out the thousands of Wikis
>>out there.
>
>robots.txt ?
>
>Google adheres to that very strongly. and I notice there's no
>Captcha
>
>Часто задаваемые вопросы о сканировании и индексировании | Центр Google Поиска  |  Поддержка  |  Google for Developers

Glancing at the specs, it seems that the benefits of someone posting
external links could be removed by a combination of wise robots.txt
settings and a redirect page for external links. Or, one could use the
meta tags that do the same thing:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

<META NAME="ROBOTS" CONTENT="NOFOLLOW, NOINDEX">

Should be inserted into every page with any additional query arguments
beyond the page name. These pages do nothing other than give the search
engine more work to do with zero benefit, and cost rubygarden.org money
to be browsed.

···

Mark Hubbart (discord@mac.com) wrote:

On Jul 20, 2004, at 10:07 AM, Greg Millam wrote:

This should keep any compliant search engine (including Google) from
analyzing a page for links. Which should prevent the pageranking.

If, however, some external links should be respected, there's the
redirect trick. External links go to a page which redirects to the
link. That way, you can allow certain urls (links to rubycentral,
ruby-lang, etc.) to be read, but links to unknown sites could be
filtered out, by placing meta tags correctly.

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04