Preventing Web crawlers from accessing your development site with robots.txt

We recently started a development site for a new project.  Soon, we noticed an Amazon AWS instance accessing URLs deep within our site regularly.  The site was access protected (you need to be logged in to see the pages), so those access attempts failed but were annoying nonetheless.

A bit more digging revealed that this was the Alexa bot trying to crawl our site.  I am not sure how they found it so quickly (we have no incoming links and this was a dev.****.*** subdomain) — they probably analyze DNS entries to find sites more efficiently. I also am not sure how they found the deep URLs (they are not exposed to the public part of the dev site), but Alexa was here nonetheless.

This reminded us that it’s a good idea to prevent access to your development sites using a robots.txt file at the root of your domain with the following content:

User-agent: *
Disallow: /

Of course, this will only keep out legitimate web crawlers spidering your site that actually respect the robots.txt file – but at least you have a few less people to worry about.

  • Share/Bookmark