Twitter indexing peculiarities (old, probably outdated)

2007-08-17

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

This post has one main reason: popular sites don’t always get it right. You can also turn that around: you don’t have to get everything right in order to be popular. Never do something on your site just because a large site does it like that.

Combine web 2.0 with a search engines, what do you get? Lots of rel=nofollow links (archive.org) :), heh. You’d assume that they could get a few things right with regards to search engine optimization though.

Think again.

I hope you’re listening, Twitter (archive.org) ;) and all of you who aren’t.

Canonical redirects

A canonical domain redirect is one of the ~~longest~~ first specialized, technical, SEO-type ~~words~~ things that a webmaster learns to do. Google lets you set your preferred domain version in it’s Webmaster Tools, but you will usually still want to set up a 301 redirect for all the other engines. This is fairly simple to do on Apache (archive.org), but you can also do it incorrectly.

Checking the server headers for http://www.twitter.com/ (archive.org) you see:

Results of the http://oyoy.eu/ server-headers test
Tested at 17.08.2007 21:12:59:

URL=http://www.twitter.com/
**Result code: 302 (Found / Found)**
Location: http://twitter.com/
Server: BIG-IP
Connection: Keep-Alive
Content-Type: text/html
Content-Length: 0
New location: http://twitter.com/

URL=http://twitter.com/
Result code: 200 (OK / OK)
Connection: close
Date: Fri, 17 Aug 2007 21:13:00 GMT
Set-Cookie: _twitter_session=somelongnumber; domain=.twitter.com; path=/
Status: 200 OK
X-Runtime: 0.27641
ETag: "evenlongernumberhehehe"
Cache-Control: private, max-age=0, must-revalidate
Server: Joyent Web
Content-Type: text/html; charset=utf-8
Content-Length: 15000

The canonical redirect on twitter.com is incorrectly set up as a 302 redirect, not 301.

A 302 redirect can make sense if your site’s current main page is not the root URL - in that case, your server can 302 redirect from http://domain.com/ to http://domain.com/pages/cms/page.php?id=1 (or where ever your actual page is located). In that case, the search engines will see the temporary redirect and keep the original URL in the index, albeit with the content of the final URL.

Although Yahoo! was one of the first search engines to provide a strict set of guidelines with regards to 302 redirecting (archive.org), you can see it best in their index. When you check the indexed URLs from www.twitter.com (archive.org), you can click on the cache link (archive.org), it will show where the actual content came from:

How bad is this problem? Well… Both Yahoo (archive.org) (approx 6'400 URLs) and MSN (archive.org) (approx 3'100 URLs) have www-versions of twitter.com in their index. Google seems to have interpreted the 302 redirect as a 301, perhaps because “twitter.com” is shorter than (archive.org) “www.twitter.com”.

One of the problems with the 302 redirect is that search engines might assume that they are still on the original URL, www.twitter.com, when it comes to interpreting links. This is problematic when links are relative, as many are on the Twitter pages. A link like <a href="/friends/index/813286" … > can be interpreted as being a link to “http://www.twitter.com/friends/index/813286”. By mixing relative linking with a broken canonical redirect, the site is effectively promoting the incorrect version of it’s URL.

Additional canonicalization problems

What, more canonicalization problems? Some servers are set up to serve the same content through https (the secure connection). If a server does that, it makes sense to block indexing of that content, either through a robots-meta-tag or (for Google) the new X-Robots tag (archive.org) in the HTTP header. Twitter has the https:-version of it’s URLs indexed - at least on Google (though you can’t query these separately), I didn’t spot any on Yahoo or MSN.

Even more canonicalization issues - getting your server’s IP address indexed

… is a really bad idea. What happens when you move? What happens to the cookies stored for the site? What happens when you want to expand and add a round-robin DNS system to spread the load over multiple servers?

Google (archive.org) (385'000 URLs), Yahoo (archive.org) (5'700 URLs) and MSN (archive.org) (22 URLs) all have Twitter’s IP address indexed. As far as I can tell, this arose from a glitch in their website some time back – many of the profiles were linked through the IP-address instead of the domain name (from the twitter.com domain name). The profiles are now linked with an absolute URL on “twitter.com”, but these used to be linked with a relative URL as well: further promoting the IP-addresses for twitter profiles.

On Apache, with a proper .htaccess file, this would be easy to fix (archive.org): just have all accesses 301 redirected to the proper URL.

And even more canonicalization problems …..

Since the server responds to all requests with the content of twitter.com, any server name that resolves to Twitter’s IP address can get indexed. One domain that I found was gezwitscher.com (archive.org) (which means something like “twitter” in German). There are also several subdomains on other sites that resolve to that IP address.

Twitter has a canonicalization problem, though some of the engines are guessing more or less correctly about what should be indexed and what shouldn’t.

Use of rel=nofollow microformats on links

The original introduction of this microformat (archive.org) mentions that it can be used to prevent comment spam from gaining value through links. It has since been expanded to usage as a general block for the crawling of a link (though this is a bit controversial, as any link without rel=nofollow would result in crawling of the linked URL anyway).

Twitter has recently added the rel=nofollow microformat to all links that are posted by the users of Twitter (there are no anonymous comments to postings on Twitter). Other, internal links are without this microformat. However, the user’s homepages (from their profile) also have links without the rel=nofollow microformat (if a user regularly uses the site and has many friends who link to him, it can be assumed that the user’s homepage is not spammy). If the homepage is trusted, why would the other links that the user places not be trustworthy? The links are already locked behind “tinyurl” (it would be nice to see the final URL as a tooltip over the link, by the way), why add the nofollow?

Adding the nofollow microformat to the posted links does not seem consistent with the handling of the user’s homepage links. However, it’s still better to have just the homepage linked than not have any links at all.

If Twitter had to rely on traffic from search engines, these issues could have a big impact. I imagine Twitter does not have to rely on that traffic source, which makes issues like the above less important. However, since the profiles are indexable, perhaps they do want some sort of traffic from the search engines, at least based on those profiles.

So what …. ?

Twitter can get away with these mistakes because it doesn’t need the search engines. Most other sites are different and need all the help they can get - which includes technicalities like fixing the canonical redirect and reducing the number of URLs in use.

PS I have nothing against Twitter, in fact I really like to use it for quick updates :).