[go: nahoru, domu]

Hide
This recommendation is officially deprecated as of October 2015.

Making AJAX applications crawlable

If you're running an AJAX application with content that you'd like to appear in search results, we have a new process that, when implemented, can help Google (and potentially other search engines) crawl and index your content. Historically, AJAX applications have been difficult for search engines to process because AJAX content is produced dynamically by the browser and thus not visible to crawlers. While there are existing methods for dealing with this problem, they involve regular manual maintenance to keep the content up-to-date.

What the user sees, what the crawler sees

In recent years, more and more of the web has become populated with AJAX-based applications, replacing static HTML pages. This is a great development for users because it makes applications much faster and richer. But making your application more responsive has come at a huge cost: crawlers are not able to see any content that is created dynamically. As a result, the most modern applications are also the ones that are often the least searchable. For example, a typical AJAX application may result in the following being seen by the crawler:

<html>
  <head>
    <title>MovieInfo</title>
     <script language='javascript' src='getMovieInformation.js'></script>
  </head>
  <body>
</html>

But imagine that what the user actually sees in the browser is lots of content relating to movies and information about them. How does this happen? The browser executes the script getMovieInformation.js and creates the HTML that the user sees, for example something like this:

<html>
  <head>
    <title>MovieInfo</title>
  </head>
  <body>
    <div id="browseArea">
    ...
    <div style="font-weight: bold;">Select from below:</div>
    ...
    <div id="browseTable" valign="top">
      ...
      <a href="#%21tab0&q=Walking+on+Frozen+Water" class="menuItem">Walking on Frozen Water</a>
      ...
      <a href="#%21tab0&q=Climbing+Mauna+Kea" class="menuItem">Climbing Mauna Kea</a>
      ...
      <a href="#%21tab0&q=Sea+Turtles" class="menuItem">Sea Turtles</a>
      ...
      <a href="#%21tab0&q=This+Street+Makes+Me+Look+Fat" class="menuItem">This Street Makes Me Look Fat</a>
      ...
      <a href="#%21tab0&q=Octopus+spotting" class="menuItem">Octopus spotting</a>
      ...
      <a href="#%21tab0&q=Falling+in+Love" class="menuItem">Falling in Love</a>
      ...
    </div>
    <div id="load">
     <p>Octopus spotting follows an octopus through an average octopus day. It tells stories of hiding from predators and divers, of the neighborhood the octopus lives in, and the other animals that share its living quarters.</p>
    </div>
    ...
  </body>
</html>

This is an important idea: the browser can execute JavaScript and produce content on the fly - the crawler cannot. To make the crawler see what a user sees, the server needs to give a crawler an HTML snapshot, the result of executing the JavaScript on your page. Our approach enables exactly that: it allows the site owner's own web server to return to the crawler this HTML -- created from static content pieces as well as by executing JavaScript -- for the application's pages. This is what we call an HTML snapshot.

If you're curious about your own application, load it in a browser and then view the source (for example, in Firefox, right-click and select "View Page Source"). In our example, "View Page Source" would not contain the word "octopus". Similarly, if some of your content is created dynamically, the page source will not include all the content you will want the crawler to see. In other words "View Page Source" is exactly what the crawler gets. Why is this important? It is important because search results are based in part on the words that the crawler finds on the page. In other words, if the crawler can't find your content, it's not searchable.

Current practice

Currently, webmasters create a "parallel universe" of content. Users of JavaScript-enabled browsers will see content that is created dynamically, whereas users of non-JavaScript-enabled browsers as well as crawlers will see content that is static and created offline. In current practice, "progressive enhancement" in the form of Hijax-links are often used. From the Official Google Webmaster Central Blog:

If you're starting from scratch, one good approach is to build your site's structure and navigation using only HTML. Then, once you have the site's pages, links, and content in place, you can spice up the appearance and interface with AJAX. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your AJAX bonuses.

Of course you will likely have links requiring JavaScript for AJAX functionality, so here's a way to help AJAX and static links coexist: When creating your links, format them so they'll offer a static link as well as calling a JavaScript function. That way you'll have the AJAX functionality for JavaScript users, while non-JavaScript users can ignore the script and follow the link. For example:

<a href="ajax.htm?foo=32" >foo 32</a>

Note that the static link's URL has a parameter (?foo=32) instead of a fragment (#foo=32), which is used by the AJAX code. This is important, as search engines understand URL parameters but often ignore fragments. Web developer Jeremy Keith labeled this technique as Hijax. Since you now offer static links, users and search engines can link to the exact content they want to share or reference.

(Source: http://googlewebmastercentral.blogspot.com/2007/11/spiders-view-of-web-20.html)

This approach will continue to work. If your site is already configured with Hijax, you're good to go. But if your content changes regularly and you don't want to update it manually, or if you want search engines to serve fast AJAX links, or if you have not yet implemented the Hijax scheme, you should consider the new scheme we describe here.

An agreement between crawler and server

In order to make your AJAX application crawlable, your site needs to abide by a new agreement. This agreement rests on the following:

  1. The site adopts the AJAX crawling scheme.
  2. For each URL that has dynamically produced content, your server provides an HTML snapshot, which is the content a user (with a browser) sees. Often, such URLs will be AJAX URLs, that is, URLs containing a hash fragment, for example www.example.com/index.html#key=value, where #key=value is the hash fragment. An HTML snapshot is all the content that appears on the page after the JavaScript has been executed.
  3. The search engine indexes the HTML snapshot and serves your original AJAX URLs in search results.

In order to make this work, the application must use a specific syntax in the AJAX URLs (let's call them "pretty URLs;" you'll see why in the following sections). The search engine crawler will temporarily modify these "pretty URLs" into "ugly URLs" and request those from your server. This request of an "ugly URL" indicates to the server that it should not return the regular web page it would give to a browser, but instead an HTML snapshot. When the crawler has obtained the content for the modified ugly URL, it indexes its content, then displays the original pretty URL in the search results. In other words, end users will always see the pretty URL containing a hash fragment. The following diagram summarizes the agreement:

For more details on how to implement this agreement on your server, keep reading: Getting started guide and Specification.