Crawler ignores robots.txt with 302 code. How to make it 200?

This is a discussion on Crawler ignores robots.txt with 302 code. How to make it 200? within the Linux Web Servers forums, part of the Web Server and Related Forums category; I recently discovered that a web-crawler is ignoring my robots.txt. Surprisingly, it was easy to contact a technical ...


Go Back   Usenet Forums > Web Server and Related Forums > Linux Web Servers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 01-17-2008
Patrick Nolan
 
Posts: n/a
Default Crawler ignores robots.txt with 302 code. How to make it 200?

I recently discovered that a web-crawler is ignoring my robots.txt.
Surprisingly, it was easy to contact a technical person at the company.
The gist of the matter is that their crawling software ignores robots.txt
unless it has a 200 status code. Mine give 302 sometimes. He's going
to take it up with the crawler writers, but I don't have much hope that
it will be fixed soon. So I want to solve it on my end so I always
send 200.

This is Apache 2.2.3 on Linux.

This machine has two names with the same IP address. One address (let's
call it goodhost) causes a 200 response, and the other (badhost) gets 302.
Both names have been exposed to the internet, so I expect to get traffic
on both forever. I don't explicitly redirect robots.txt. ServerName in
the Apache configuration file is goodhost. I turned UseCanonicalName off
for some reason which seemed valid at the time, although I can't remember
it now.

I don't know if it's relevant, but there's a VirtualHost like this:
<VirtualHost <dotted.ip.address>:80>
ServerName badhost.domain
DocumentRoot /var/www/html
ServerAlias badhost
Redirect permanent /online https://goodhost.domain/online
</VirtualHost>
and another one
<VirtualHost <dotted.ip.address.:80>
Redirect permanent /online https://goodhost.domain/online
</VirtualHost>
Neither of these mentions robots.txt explicitly, but I wonder if there
might be a side effect.

  #2 (permalink)  
Old 01-17-2008
Grant
 
Posts: n/a
Default Re: Crawler ignores robots.txt with 302 code. How to make it 200?

On Thu, 17 Jan 2008 01:43:48 +0000 (UTC), Patrick Nolan <pln@glast2.Stanford.EDU> wrote:

>I recently discovered that a web-crawler is ignoring my robots.txt.
>Surprisingly, it was easy to contact a technical person at the company.
>The gist of the matter is that their crawling software ignores robots.txt
>unless it has a 200 status code. Mine give 302 sometimes. He's going
>to take it up with the crawler writers, but I don't have much hope that
>it will be fixed soon. So I want to solve it on my end so I always
>send 200.
>
>This is Apache 2.2.3 on Linux.
>
>This machine has two names with the same IP address. One address (let's
>call it goodhost) causes a 200 response, and the other (badhost) gets 302.
>Both names have been exposed to the internet, so I expect to get traffic
>on both forever. I don't explicitly redirect robots.txt. ServerName in
>the Apache configuration file is goodhost. I turned UseCanonicalName off
>for some reason which seemed valid at the time, although I can't remember
>it now.
>
>I don't know if it's relevant, but there's a VirtualHost like this:
><VirtualHost <dotted.ip.address>:80>
> ServerName badhost.domain
> DocumentRoot /var/www/html
> ServerAlias badhost
> Redirect permanent /online https://goodhost.domain/online
></VirtualHost>
>and another one
><VirtualHost <dotted.ip.address.:80>
> Redirect permanent /online https://goodhost.domain/online
></VirtualHost>
>Neither of these mentions robots.txt explicitly, but I wonder if there
>might be a side effect.


I have 'badhost' going to an empty site with only robots.txt, let the
crawlers get 404 for all the old content. After a couple year still get
hits on robots.txt, rarely get hits on content.

Grant.
--
http://bugsplatter.mine.nu/
  #3 (permalink)  
Old 4 Weeks Ago
D. Stussy
 
Posts: n/a
Default Re: Crawler ignores robots.txt with 302 code. How to make it 200?

"Patrick Nolan" <pln@glast2.Stanford.EDU> wrote in message
news:slrnfotcmk.444.pln@glast2.Stanford.EDU...
> I recently discovered that a web-crawler is ignoring my robots.txt.
> Surprisingly, it was easy to contact a technical person at the company.
> The gist of the matter is that their crawling software ignores robots.txt
> unless it has a 200 status code. Mine give 302 sometimes. He's going
> to take it up with the crawler writers, but I don't have much hope that
> it will be fixed soon. So I want to solve it on my end so I always
> send 200.
>
> This is Apache 2.2.3 on Linux.
>
> This machine has two names with the same IP address. One address (let's
> call it goodhost) causes a 200 response, and the other (badhost) gets 302.
> Both names have been exposed to the internet, so I expect to get traffic
> on both forever. I don't explicitly redirect robots.txt. ServerName in
> the Apache configuration file is goodhost. I turned UseCanonicalName off
> for some reason which seemed valid at the time, although I can't remember
> it now.
>
> I don't know if it's relevant, but there's a VirtualHost like this:
> <VirtualHost <dotted.ip.address>:80>
> ServerName badhost.domain
> DocumentRoot /var/www/html
> ServerAlias badhost
> Redirect permanent /online https://goodhost.domain/online
> </VirtualHost>
> and another one
> <VirtualHost <dotted.ip.address.:80>
> Redirect permanent /online https://goodhost.domain/online
> </VirtualHost>
> Neither of these mentions robots.txt explicitly, but I wonder if there
> might be a side effect.
>


The thing that you seem to be missing is from the "/robots.txt"
specification: Either the resource exists or it doesn't. That means that
it may not be redirected like you're trying to do.

'This file must be accessible via HTTP on the local URL "/robots.txt".'
http://www.robotstxt.org/orig.html section labelled "The Method,"
paragraph 1.

Now, the robot is also misbehaving. When a code other than 200, 304, 404,
or 410 is returned, it needs to treat the virtual domain as bad and not
fetch anything more from it. If it is indeed fetching other resources, it
is a misbehaving robot. 200 and 304 are valid to indicate the resource
exists (and was served or has not changed since the last read), and 404 and
410 mean that there is no "/robots.txt" resource and everything at the site
is fair game. Everything else is a server configuration error.


 


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 06:21 PM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0