[go: nahoru, domu]

Crawling all (most) of the web's robots.txt comments

Starting from this tweet …


View tweet

… I hacked together a few-lined robots.txt comment parser. I thought it was fun enough to drop here.

a sample robots.txt snippet

Crawling the web for all robots.txt file comments

curl -s -N http://s3.amazonaws.com/alexa-static/top-1m.csv.zip \
  >top1m.zip && unzip -p top1m.zip  >top1m.csv

while read li; do
  d=$(echo $li|cut -f2 -d,)
  curl -Ls -m10 $d/robots.txt | grep "^#" | sed "s/^/$d: /" | tee -a allrobots.txt
done < top1m.csv

The first line (curl -s ...) fetches a list of the top 1 million domains from Amazon’s Alexa. Alexa, before it became the ubiquitous home appliance, used to track popular websites. My guess is this list isn’t updated anymore, but for our purposes it’s ok if it’s older.

The while ... done loop pulls out the domain names, fetches the robots.txt, and places all comments into the allrobots.txt file.

This takes approximately forever, but it’s in the order of the popular sites, so you’ll start to see results fairly quickly.

I stopped at about 265k sites. The results are in this Gist.

Unique entries

Since robots.txt files are surprisingly uncreative, you can just filter for unique entries:

cat allrobots.txt| grep -o "#.*" | awk '!x[$0]++' >unique-comments.txt

(This drops the domain name – otherwise we can’t tell if a line is unique – so if you want to map it back to a site, search through the allrobots.txt file.)

The results of this for my 265k sites are here.

Neatest robots.txt file comments

Obviously a subjective matter :-). I thought these were pretty neat (mostly for ASCII art)

Really serious robots.txt files - most exclamation marks:

blend-exchange.com: # Robots, get off my lawn!!!!!!
about-windows.ru: #this is the new server!!!!
listennotes.com: # Hello, robots!!!!
  • Mentions of Asimov: 14
  • Mentions of Google: 24 585
  • Mentions of SEO : 1 757 (with trailing space)
  • Mentions of robotstxt.org: 11 282
  • Mentions of spiders: 7 173
  • Mentions of tigers: 0
  • Mentions of link juice: 1
  • Mentions of prohibited: 124
  • Mentions of strictly prohibited: 60
  • Mentions of contact us: 36

Honorable mentions:

blend-exchange.com: # I know at some point some weird person is actually going to read this and expect a joke
blend-exchange.com: # knock knock
blend-exchange.com: # who's there?
blend-exchange.com: # blunt pencil
blend-exchange.com: # blunt pencil who?
blend-exchange.com: # there is no point to this joke

ndzperformance.com: #q-why was the android itchy?
ndzperformance.com: #a-roboticks.
ndzperformance.com: #q-what did the robot call its creator?
ndzperformance.com: #a-da-ta
ndzperformance.com: #q-what kind or androids do you find in the arctic?
ndzperformance.com: #a-snobots.
ndzperformance.com: #q-what do you call an android crew team?
ndzperformance.com: #a-rowbots.
ndzperformance.com: #q-why did the robot run away?
ndzperformance.com: #a-it heard an electric can opener.
ndzperformance.com: #q-what kind of salad do androids like?
ndzperformance.com: #a-ones made with ice-borg lettuce.

Jobs listings everywhere …

glassdoor.com: # Think you have what it takes to join the best white-hat SEO growth hackers on the planet, and help improve the way people everywhere find jobs?
coursehero.com: # Why not apply your inquisitive nature to help students and educators succeed?
tripadvisor.com: # Think you have what it takes to join the best white-hat SEO growth hackers on the planet?
bloomberg.com: # If you can read this then you should apply here https://www.bloomberg.com/careers/
desidime.com: # If you are a growth hacker and technical aspects of SEO makes you excited, you have found a right team. Apply to us at jobs@desidime.com and dont forget to mention that you found us via Robots.txt for bonus points. ;)
clover.com: # If you are human and can read this, you should apply for a job at Clover.
coolblue.nl: # Apply at www.careersatcoolblue.com and mention this comment.
bruceclay.com: # Bruce Clay Inc is hiring elite SEOs to serve our clients. Apply to join the team at https://www.bruceclay.com/employment/
twitchy.com: # If you can read this, you should apply here: https://townhall.com/pages/townhall_jobs/
chartbeat.com: # You seem like the curious type. Care to join us?
meetanshi.com: #  If you are looking at this human, we may work together! Join Us:
iprice.co.id: #    We're always on the lookout for talented people to come join us
yanolja.com: # Come and join us at http://yanolja.in/recruitment
skywebtech.net: # If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself.
tnt.com: # We might be looking for you to join our SEO team
bark.com: # We're always looking for clever people.
sportsbookreview.com: # I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I wont wish the wish you wish to wish.
sportsbookreview.com: # Are you an SEO that can say this tongue twister 10 times quickly? If yes, then we are looking for you! Send us your CV at webmaster[at]
evitamins.com: # We're looking for people like you to join our team
freehunter.tw: #  But we are looking for some interesting stalker

What else is there to find?

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Mastodon and mention me ( @hi@johnmu.com ) there. Thanks!

Related pages