The connection quality rouses no cavils Puppethead
Sunday, 16 February 2003 CST
Web Page Steganography and Googlebot
Google and other search engines crawl the web looking for new stuff to add to their caches. They typically obey robots.txt to preclude specified content. This is a great thing, because it allows each site to suggest what not to cache. It does assume, however, that everyone plays by the rules.
It seems to me that using the last modified time of a web page is about the only way search engines update their caches (I can think of other, less-obvious ways that don't seem worth implementing). So the webcrawlers like googlebot scan cached resources on some determinate basis to see if the content has changed, updating the cache if necessary.
Say one were to create a web page with a mod time of T and let it get cached by Google. If the content were then changed, but the last mod time were set to T-n, Google would not update the content. The new content would exist in a time older than the cached content, thus being considered out-of-date.
This, in effect, would be a steganographic means of hiding web content behind already-cached data. It is invisible to Google and yet people would be able to find your page by searching and see the actual content when they visit.
Is this useful? Who knows. Could it be overcome by webcrawlers? Probably. But it is an interesting idea.
Stuff here:
Stuff elsewhere:
Electronic Frontier Foundation
Blog archives:
Blog roll:

This work is licensed under a Creative Commons License.