"Patrick Nolan" <pln@glast2.Stanford.EDU> wrote in message
news:slrnfotcmk.444.pln@glast2.Stanford.EDU...
> I recently discovered that a web-crawler is ignoring my robots.txt.
> Surprisingly, it was easy to contact a technical person at the company.
> The gist of the matter is that their crawling software ignores robots.txt
> unless it has a 200 status code. Mine give 302 sometimes. He's going
> to take it up with the crawler writers, but I don't have much hope that
> it will be fixed soon. So I want to solve it on my end so I always
> send 200.
>
> This is Apache 2.2.3 on Linux.
>
> This machine has two names with the same IP address. One address (let's
> call it goodhost) causes a 200 response, and the other (badhost) gets 302.
> Both names have been exposed to the internet, so I expect to get traffic
> on both forever. I don't explicitly redirect robots.txt. ServerName in
> the Apache configuration file is goodhost. I turned UseCanonicalName off
> for some reason which seemed valid at the time, although I can't remember
> it now.
>
> I don't know if it's relevant, but there's a VirtualHost like this:
> <VirtualHost <dotted.ip.address>:80>
> ServerName badhost.domain
> DocumentRoot /var/www/html
> ServerAlias badhost
> Redirect permanent /online https://goodhost.domain/online
> </VirtualHost>
> and another one
> <VirtualHost <dotted.ip.address.:80>
> Redirect permanent /online https://goodhost.domain/online
> </VirtualHost>
> Neither of these mentions robots.txt explicitly, but I wonder if there
> might be a side effect.
>
The thing that you seem to be missing is from the "/robots.txt"
specification: Either the resource exists or it doesn't. That means that
it may not be redirected like you're trying to do.
'This file must be accessible via HTTP on the local URL "/robots.txt".'
http://www.robotstxt.org/orig.html section labelled "The Method,"
paragraph 1.
Now, the robot is also misbehaving. When a code other than 200, 304, 404,
or 410 is returned, it needs to treat the virtual domain as bad and not
fetch anything more from it. If it is indeed fetching other resources, it
is a misbehaving robot. 200 and 304 are valid to indicate the resource
exists (and was served or has not changed since the last read), and 404 and
410 mean that there is no "/robots.txt" resource and everything at the site
is fair game. Everything else is a server configuration error.