Sunday, 16 February 2003 CST

Web Page Steganography and Googlebot

Google and other search engines crawl the web looking for new stuff to add to their caches. They typically obey robots.txt to preclude specified content. This is a great thing, because it allows each site to suggest what not to cache. It does assume, however, that everyone plays by the rules.

It seems to me that using the last modified time of a web page is about the only way search engines update their caches (I can think of other, less-obvious ways that don't seem worth implementing). So the webcrawlers like googlebot scan cached resources on some determinate basis to see if the content has changed, updating the cache if necessary.

Say one were to create a web page with a mod time of T and let it get cached by Google. If the content were then changed, but the last mod time were set to T-n, Google would not update the content. The new content would exist in a time older than the cached content, thus being considered out-of-date.

This, in effect, would be a steganographic means of hiding web content behind already-cached data. It is invisible to Google and yet people would be able to find your page by searching and see the actual content when they visit.

Is this useful? Who knows. Could it be overcome by webcrawlers? Probably. But it is an interesting idea.

kherr @ 00:37 CST | link | tech