Internet Business

Using Google Cache for Copyright infringement and forensics

Google's cache can be used in copyright lawsuits since it's cache content can serve as a proof of a website's old content which could be infringing on copyright of others or even containing illegal matter.

Images are also cached by Google to show them to those using Google Images Search.

There is Google Cache Continue Redux script using which you can travel whole site using Google caches.

What does Google store about your site?

Cache

Whenever Google crawls your site - it also stores a copy of your web page on it's servers which anybody can use it to view the archived page or an image. This is called cache. Every web page or an image is identified by an URL. You can see the cache of any page in Google with "cache:<url>" in Google search and if present you can see the cache.

For example just type in this:

cache:wikipedia.org

The Google shows you how the page looked when it crawled this url:

Google's cache of Wikipedia home page

You can view the cache from the Google search results too. In the search results just hover your mouse over the right side of the result line then a double arrow(>>) appears and click on it:

Hover you mouse and click on the double arrows on the right to see the cache

Google's official page about it's cache is here. There may not be anymore information available about Google's cache.

The cache if present will show the date when it was cached which could be from a few seconds to 60 days.

Site meta-data

As Google crawls each page it calculates it's targeted keywords, niche category, neighborhood hints, related links and many more such parameters( more than 200). This includes earlier history of a page. Whenever a visitor types in a keyword in Google, based on this Meta data only Google decides whether or not to show the page, and if decides to show at which position it'd be placed.

This meta data is private and very confidential for Google as it reflects it's internal copyright algorithms due to which is it no-1 in search industry.

De-indexing the page

It is a term used by Google and search engines. Completely removing all history and cache of page from their servers is called de-indexing the page. This happens most likely due to mala fide pages using tactics(black hat SEO) to increase their search engine rankings in internet to attract more traffic. Such sites do not appear any more in Google or search engines.

Whole websites or individual URL's may also be de-indexed by Google as per US DMCA requirement in case anybody complains about illegal stuff infringing on somebody else's copyright.

Web masters can get the pages de-indexed by adding relevant lines in robots.txt. There is even place in Google Webmaster Tools for page removal requests for the web-masters to manually feed the links of their sites to be de-indexed by Google.

Just by removing a page from site will not de-index the page from Google's database. Google will wait for months before it permanently deletes a page's history if that is found inaccessible even after long time.

De-indexing the cache

It is not same as de-indexing the page since removing the cache only removes the copy of page backup stored on Google's servers which may not be of much use to google.

Indexing the pages

Indexing of the site pages is completely in control of the site web-masters If robots.txt file is present in the site and if it is not forbidden to index pages using a disallow entry in it then Google and other search engines will one day or the other index those pages.

Individual web pages can also contain this line so that they be not indexed at all:

<meta name="robots" content="noindex" />

Similarly another directive may be used so that Google will not cache the page content but only index it. Re-ndexing after deletion request however will take at least 90 days for Google.

Google does not officially declare when it will re-index a page or crawl a new page first time. However it could take from a few seconds to 60 days for Google to revisit and index the page again. This refreshes the cache too.

Syndicate content