This is a discussion on Crawler ignores robots.txt with 302 code. How to make it 200? within the Linux Web Servers forums, part of the Web Server and Related Forums category; I recently discovered that a web-crawler is ignoring my robots.txt. Surprisingly, it was easy to contact a technical ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
I recently discovered that a web-crawler is ignoring my robots.txt.
Surprisingly, it was easy to contact a technical person at the company. The gist of the matter is that their crawling software ignores robots.txt unless it has a 200 status code. Mine give 302 sometimes. He's going to take it up with the crawler writers, but I don't have much hope that it will be fixed soon. So I want to solve it on my end so I always send 200. This is Apache 2.2.3 on Linux. This machine has two names with the same IP address. One address (let's call it goodhost) causes a 200 response, and the other (badhost) gets 302. Both names have been exposed to the internet, so I expect to get traffic on both forever. I don't explicitly redirect robots.txt. ServerName in the Apache configuration file is goodhost. I turned UseCanonicalName off for some reason which seemed valid at the time, although I can't remember it now. I don't know if it's relevant, but there's a VirtualHost like this: <VirtualHost <dotted.ip.address>:80> ServerName badhost.domain DocumentRoot /var/www/html ServerAlias badhost Redirect permanent /online https://goodhost.domain/online </VirtualHost> and another one <VirtualHost <dotted.ip.address.:80> Redirect permanent /online https://goodhost.domain/online </VirtualHost> Neither of these mentions robots.txt explicitly, but I wonder if there might be a side effect. |
|
|||
|
On Thu, 17 Jan 2008 01:43:48 +0000 (UTC), Patrick Nolan <pln@glast2.Stanford.EDU> wrote:
>I recently discovered that a web-crawler is ignoring my robots.txt. >Surprisingly, it was easy to contact a technical person at the company. >The gist of the matter is that their crawling software ignores robots.txt >unless it has a 200 status code. Mine give 302 sometimes. He's going >to take it up with the crawler writers, but I don't have much hope that >it will be fixed soon. So I want to solve it on my end so I always >send 200. > >This is Apache 2.2.3 on Linux. > >This machine has two names with the same IP address. One address (let's >call it goodhost) causes a 200 response, and the other (badhost) gets 302. >Both names have been exposed to the internet, so I expect to get traffic >on both forever. I don't explicitly redirect robots.txt. ServerName in >the Apache configuration file is goodhost. I turned UseCanonicalName off >for some reason which seemed valid at the time, although I can't remember >it now. > >I don't know if it's relevant, but there's a VirtualHost like this: ><VirtualHost <dotted.ip.address>:80> > ServerName badhost.domain > DocumentRoot /var/www/html > ServerAlias badhost > Redirect permanent /online https://goodhost.domain/online ></VirtualHost> >and another one ><VirtualHost <dotted.ip.address.:80> > Redirect permanent /online https://goodhost.domain/online ></VirtualHost> >Neither of these mentions robots.txt explicitly, but I wonder if there >might be a side effect. I have 'badhost' going to an empty site with only robots.txt, let the crawlers get 404 for all the old content. After a couple year still get hits on robots.txt, rarely get hits on content. Grant. -- http://bugsplatter.mine.nu/ |
|
|||
|
"Patrick Nolan" <pln@glast2.Stanford.EDU> wrote in message
news:slrnfotcmk.444.pln@glast2.Stanford.EDU... > I recently discovered that a web-crawler is ignoring my robots.txt. > Surprisingly, it was easy to contact a technical person at the company. > The gist of the matter is that their crawling software ignores robots.txt > unless it has a 200 status code. Mine give 302 sometimes. He's going > to take it up with the crawler writers, but I don't have much hope that > it will be fixed soon. So I want to solve it on my end so I always > send 200. > > This is Apache 2.2.3 on Linux. > > This machine has two names with the same IP address. One address (let's > call it goodhost) causes a 200 response, and the other (badhost) gets 302. > Both names have been exposed to the internet, so I expect to get traffic > on both forever. I don't explicitly redirect robots.txt. ServerName in > the Apache configuration file is goodhost. I turned UseCanonicalName off > for some reason which seemed valid at the time, although I can't remember > it now. > > I don't know if it's relevant, but there's a VirtualHost like this: > <VirtualHost <dotted.ip.address>:80> > ServerName badhost.domain > DocumentRoot /var/www/html > ServerAlias badhost > Redirect permanent /online https://goodhost.domain/online > </VirtualHost> > and another one > <VirtualHost <dotted.ip.address.:80> > Redirect permanent /online https://goodhost.domain/online > </VirtualHost> > Neither of these mentions robots.txt explicitly, but I wonder if there > might be a side effect. > The thing that you seem to be missing is from the "/robots.txt" specification: Either the resource exists or it doesn't. That means that it may not be redirected like you're trying to do. 'This file must be accessible via HTTP on the local URL "/robots.txt".' http://www.robotstxt.org/orig.html section labelled "The Method," paragraph 1. Now, the robot is also misbehaving. When a code other than 200, 304, 404, or 410 is returned, it needs to treat the virtual domain as bad and not fetch anything more from it. If it is indeed fetching other resources, it is a misbehaving robot. 200 and 304 are valid to indicate the resource exists (and was served or has not changed since the last read), and 404 and 410 mean that there is no "/robots.txt" resource and everything at the site is fair game. Everything else is a server configuration error. |
| Thread Tools | |
| Display Modes | |
|
|