View Single Post

  #2 (permalink)  
Old 01-17-2008
Grant
 
Posts: n/a
Default Re: Crawler ignores robots.txt with 302 code. How to make it 200?

On Thu, 17 Jan 2008 01:43:48 +0000 (UTC), Patrick Nolan <pln@glast2.Stanford.EDU> wrote:

>I recently discovered that a web-crawler is ignoring my robots.txt.
>Surprisingly, it was easy to contact a technical person at the company.
>The gist of the matter is that their crawling software ignores robots.txt
>unless it has a 200 status code. Mine give 302 sometimes. He's going
>to take it up with the crawler writers, but I don't have much hope that
>it will be fixed soon. So I want to solve it on my end so I always
>send 200.
>
>This is Apache 2.2.3 on Linux.
>
>This machine has two names with the same IP address. One address (let's
>call it goodhost) causes a 200 response, and the other (badhost) gets 302.
>Both names have been exposed to the internet, so I expect to get traffic
>on both forever. I don't explicitly redirect robots.txt. ServerName in
>the Apache configuration file is goodhost. I turned UseCanonicalName off
>for some reason which seemed valid at the time, although I can't remember
>it now.
>
>I don't know if it's relevant, but there's a VirtualHost like this:
><VirtualHost <dotted.ip.address>:80>
> ServerName badhost.domain
> DocumentRoot /var/www/html
> ServerAlias badhost
> Redirect permanent /online https://goodhost.domain/online
></VirtualHost>
>and another one
><VirtualHost <dotted.ip.address.:80>
> Redirect permanent /online https://goodhost.domain/online
></VirtualHost>
>Neither of these mentions robots.txt explicitly, but I wonder if there
>might be a side effect.


I have 'badhost' going to an empty site with only robots.txt, let the
crawlers get 404 for all the old content. After a couple year still get
hits on robots.txt, rarely get hits on content.

Grant.
--
http://bugsplatter.mine.nu/