Official Google Webmaster Central Blog

A note on unsupported rules in robots.txt

Tuesday, July 02, 2019

Yesterday we announced that we're open-sourcing Google's production robots.txt parser. It was an exciting moment that paves the road for potential Search open sourcing projects in the future! Feedback is helpful, and we're eagerly collecting questions from developers and webmasters alike. One question stood out, which we'll address in this post:
Why isn't a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an extensible architecture for rules that are not part of the standard. This means that if a crawler wanted to support their own line like "unicorns: allowed", they could. To demonstrate how this would look in a parser, we included a very common line, sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular, we focused on rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex. Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites' presence in Google's search results in ways we don’t think webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we're retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019. For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:

Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
404 and 410 HTTP status codes: Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.
Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.
Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results.

For more guidance about how to remove information from Google's search results, visit our Help Center. If you have questions, you can find us on Twitter and in our Webmaster Community, both offline and online.

Posted by Gary

Google's robots.txt parser is now open source

Monday, July 01, 2019

For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.
We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
We also included a testing tool in the open source package to help you test a few rules. Once built, the usage is very straightforward:
robots_main <robots.txt content> <user_agent> <url>
If you want to check out the library, head over to our GitHub repository for the robots.txt parser. We'd love to see what you can build using it! If you built something using the library, drop us a comment on Twitter, and if you have comments or questions about the library, find us on GitHub.
Posted by Edu Pereda, Lode Vandevenne, and Gary, Search Open Sourcing team

Formalizing the Robots Exclusion Protocol Specification

Monday, July 01, 2019

For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web. It allows website owners to exclude automated clients, for example web crawlers, from accessing their sites - either partially or completely.
In 1994, Martijn Koster (a webmaster himself) created the initial standard after crawlers were overwhelming his site. With more input from other webmasters, the REP was born, and it was adopted by search engines to help website owners manage their server resources easier.
However, the REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.
We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF.
The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they'd like to be crawled on their site and potentially shown to interested users. It doesn't change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. Notably:

Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well.
Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren't overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.

Additionally, we've updated the augmented Backus–Naur form in the internet draft to better define the syntax of robots.txt, which is critical for developers to parse the lines.
RFC stands for Request for Comments, and we mean it: we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.
If you'd like to drop us a comment, ask us questions, or just say hi, you can find us on Twitter and in our Webmaster Community, both offline and online.

Posted by Henner Zeller, Lizzi Harvey, and Gary

Bye Bye Preferred Domain setting

Tuesday, June 18, 2019

As we progress with the migration to the new Search Console experience, we will be saying farewell to one of our settings: preferred domain.

It's common for a website to have the same content on multiple URLs. For example, it might have the same content on http://example.com/ as on https://www.example.com/index.html. To make things easier, when our systems recognize that, we'll pick one URL as the "canonical" for Search. You can still tell us your preference in multiple ways if there's something specific you want us to pick (see paragraph below). But if you don't have a preference, we'll choose the best option we find. Note that with the deprecation we will no longer use any existing Search Console preferred domain configuration.

You can find detailed explanations on how to tell us your preference in the Consolidate duplicate URLs help center article. Here are some of the options available to you:

Use rel=”canonical” link tag on HTML pages
Use rel=”canonical” HTTP header
Use a sitemap
Use 301 redirects for retired URLs

Send us any feedback either through Twitter or our forum.

Posted by Daniel Waisberg, Search Advocate

Webmaster Conference: an event made for you

Monday, June 10, 2019

Over the years we attended hundreds of conferences, we spoke to thousands of webmasters, and recorded hundreds of hours of videos to help web creators find information about how to perform better in Google Search results. Now we'd like to go further: help those who aren't able to travel internationally and access the same information.

Today we're officially announcing the Webmaster Conference, a series of local events around the world. These events are primarily located where it's difficult to access search conferences or information about Google Search, or where there's a specific need for a Search event. For example, if we identify that a region has problems with hacked sites, we may organize an event focusing on that specific topic.

We want web creators to have equal opportunity in Google Search regardless of their language, financial status, gender, location, or any other attribute. The conferences are always free and easily accessible in the region where they're organized, and, based on feedback from the local communities and analyses, they're tailored for the audience that signed up for the events. That means it doesn't matter how much you already know about Google Search; the event you attend will have takeaways tailored to you. The talks will be in the local language, in case of international speakers through interpreters, and we'll do our best to also offer sign language interpretation if requested.

Webmaster Conference Okinawa

The structure of the event varies from region to region. For example, in Okinawa, Japan, we had a wonderful half-day event with novice and advanced web creators where we focused on how to perform better in Google Images. At Webmaster Conference India and Indonesia, that might change and we may focus more on how to create faster websites. We will also host web communities in Europe and North America later this year, so keep an eye out for the announcements!
We will continue attending external events as usual; we are doing these events to complement the existing ones. If you want to learn more about our upcoming events, visit the Webmaster Conference site which we'll update monthly, and follow our blogs and @googlewmc on Twitter!

Posted by Takeaki Kanaya and Gary

A video series on SEO myths for web developers

Thursday, June 06, 2019

We invited members of the SEO and web developer community to join us for a new video series called "SEO mythbusting".

In this series, we discuss various topics around SEO from a developer's perspective, how we can work to make the "SEO black box" more transparent, and what technical SEO might look like as the web keeps evolving. We already published a few episodes: Web developer's 101:

A look at Googlebot:

Microformats and structured data:

JavaScript and SEO:

We have a few more episodes for you and we will launch the next episodes weekly on the Google Webmasters YouTube channel, so don't forget to subscribe to stay in the loop. You can also find all published episodes in this YouTube playlist. We look forward to hearing your feedback, topic suggestions, and guest recommendations in the YouTube comments as well as our Twitter account! Posted by Martin Splitt, friendly web fairy & series host, WTA team

Mobile-First Indexing by default for new domains

Tuesday, May 28, 2019

Over the years since announcing mobile-first indexing - Google's crawling of the web using a smartphone Googlebot - our analysis has shown that new websites are generally ready for this method of crawling. Accordingly, we're happy to announce that mobile-first indexing will be enabled by default for all new, previously unknown to Google Search, websites starting July 1, 2019. It's fantastic to see that new websites are now generally showing users - and search engines - the same content on both mobile and desktop devices!

You can continue to check for mobile-first indexing of your website by using the URL Inspection Tool in Search Console. By looking at a URL on your website there, you'll quickly see how it was last crawled and indexed. For older websites, we'll continue monitoring and evaluating pages for their readiness for mobile first indexing, and will notify them through Search Console once they're seen as being ready. Since the default state for new websites will be mobile-first indexing, there's no need to send a notification.

Using the URL Inspection Tool to check the mobile-first indexing status

Our guidance on making all websites work well for mobile-first indexing continues to be relevant, for new and existing sites. For existing websites we determine their readiness for mobile-first indexing based on parity of content (including text, images, videos, links), structured data, and other meta-data (for example, titles and descriptions, robots meta tags). We recommend double-checking these factors when a website is launched or significantly redesigned.

While we continue to support responsive web design, dynamic serving, and separate mobile URLs for mobile websites, we recommend responsive web design for new websites. Because of issues and confusion we've seen from separate mobile URLs over the years, both from search engines and users, we recommend using a single URL for both desktop and mobile websites.

Mobile-first indexing has come a long way. We're happy to see how the web has evolved from being focused on desktop, to becoming mobile-friendly, and now to being mostly crawlable and indexable with mobile user-agents! We realize it has taken a lot of work from your side to get there, and on behalf of our mostly-mobile users, we appreciate that. We’ll continue to monitor and evaluate this change carefully. If you have any questions, please drop by our Webmaster forums or our public events.

Posted by John Mueller, Developer Advocate, Google Zurich

Webmaster Central Blog

A note on unsupported rules in robots.txt

Google's robots.txt parser is now open source

Formalizing the Robots Exclusion Protocol Specification

Bye Bye Preferred Domain setting

Webmaster Conference: an event made for you

A video series on SEO myths for web developers

Mobile-First Indexing by default for new domains

Labels

Archive

Feed

Subscribe via email