Wednesday, April 26, 2006

Google's "Crawl Caching Proxy" Ends My AdSense Advantage in SERPs Obsession

Google's "Crawl Caching Proxy" Ends My AdSense Advantage in SERPs Obsession

by Garrett French and Jon Revill

I spilled millions of pixels covering what I thought was an emerging Google controversy:

AdSense as Route to Google Index + Google's Jugular?

Google's Mediapartner Bot Index Creep Indicate AdSense Publisher SERPs Advantage?

Google's Matt Cutts Confirms AdSense Bot to BigDaddy Connection

And, by necessity born of my non-techtitude, I asked Jon Revill to join the fray:
Jon Revill on Load Balancing Between AdSense Bot and Googlebot

And then Matt Cutts publishes Crawl caching proxy, causing me to extinguish my torches and put the pitch forks back in the barn. As Jon says below: "[based on Cutt's post] participating in AdSense or any other Google program will not increase your ability to be cached, indexed, or ranked."

Of particular side note interest, Jon says, "from a logging perspective, the total number of requests from Google related bots to a site’s server should decrease significantly, but the amount of utilization of a site’s pages by different Google services should remain the same."

In other words, you may see Google-related bots decrease visit frequency overall as they begin to balance the load.

In response to Cutts' post, Jon Revill wrote the following to explain what's happening, and to help me sleep better at night:
Matt’s article makes a lot of sense and is brilliantly simple in terms of load balancing (I thought that IF we could have shown that Googlebot visits did NOT decrease for AdSense publishers we could catch Google in favoring its business partners in the SERPs... -G).

Essentially Google is using a system very similar to Temporary Internet Files for web browsers in that it stores a copy of the page for future crawling.

Any bot that initially accesses a page stores it in a central location that is independent of the actual search index or any other service. This location is known as the “Crawl Caching Proxy”.

If another bot attempts to access that page for a different Google service it will attempt to pull it from the proxy before requesting it from the server.

To work off of Matt’s example, we have http://www.domain.com/page.htm which publishes AdSense ads.

It is also regularly crawled by both Googlebot and Mediabot.

Given the new crawl process Mediabot will access page.htm to look over the content for publishing AdSense. As a part of this crawl, Mediabot adds the page to the Crawl Caching Proxy.

The file is not added or updated in the search index at this time.

When Googlebot prepares to crawl page.htm for the natural search index it will first check the Crawl Caching Proxy to see if the file has been recently crawled by a different Google service.

If it has, Googlebot will access the page from the proxy rather than requesting it from domain.com’s server.

From the sound of it any Google service that utilizes a spider can contribute to the proxy and any of these services can access it. This means that for however long the page is stored in the proxy, Google will only need a single crawl for all Google services.

From a logging perspective, the total number of requests from Google related bots to a site’s server should decrease significantly, but the amount of utilization of a site’s pages by different Google services should remain the same.

Matt also notes that any robots.txt exclusions will still apply. If Googlebot is blocked in the robots.txt from crawling a particular file or directory, it will not attempt to access the file from the proxy. Overall, this should make Google services a lot more efficient for both site owners and Google as a whole.

To make it very clear from Matt’s article, participating in AdSense or any other Google program will not increase your ability to be cached, indexed, or ranked.

All restrictions, exclusions and best practices still apply.

This spidering methodology still operates essentially in the same manner as previous practices. Google’s bots are simply accessing your page from a cached/stored version of the page rather than accessing them directly from the server.

Matt has provided very clear diagrams in his article, http://www.mattcutts.com/blog/crawl-caching-proxy/.

This article written by MarketSmart Interactive's Jon Revill and padded with contextually relevant information by MSI's Garrett French.

No comments: