I recently discovered that a web-crawler is ignoring my robots.txt.
Surprisingly, it was easy to contact a technical person at the company.
The gist of the matter is that their crawling software ignores robots.txt
unless it has a 200 status code. Mine give 302 sometimes. He's going
to take it up with the crawler writers, but I don't have much hope that
it will be fixed soon. So I want to solve it on my end so I always
send 200.
This is Apache 2.2.3 on Linux.
This machine has two names with the same IP address. One address (let's
call it goodhost) causes a 200 response, and the other (badhost) gets 302.
Both names have been exposed to the internet, so I expect to get traffic
on both forever. I don't explicitly redirect robots.txt. ServerName in
the Apache configuration file is goodhost. I turned UseCanonicalName off
for some reason which seemed valid at the time, although I can't remember
it now.
I don't know if it's relevant, but there's a VirtualHost like this:
<VirtualHost <dotted.ip.address>:80>
ServerName badhost.domain
DocumentRoot /var/www/html
ServerAlias badhost
Redirect permanent /online
https://goodhost.domain/online
</VirtualHost>
and another one
<VirtualHost <dotted.ip.address.:80>
Redirect permanent /online
https://goodhost.domain/online
</VirtualHost>
Neither of these mentions robots.txt explicitly, but I wonder if there
might be a side effect.