22 January 2014

What day is it? Why document dates are so important to the Google Search Appliance.

I spend a lot of time working with Google Search Appliance administrators who have set up their own GSA without our help. One very common problem that I encounter when reviewing their implementations is that the GSA can't identify when their documents were published. As a result, they may have several serious problems. The most serious problem is crawl frequency. The GSA uses document dates to schedule recrawls, with a goal of crawling each document approximately twice as frequently as it changes. Without dates, the GSA can't build this optimal recrawl schedule. Their GSAs may crawl their content too frequently, causing the content servers to become unresponsive to other users' requests. Or, their GSAs may not crawl their content enough, resulting in an out-of-date and stale index. If you find yourself using Freshness Tuning or Host Load Settings to reduce or increase your crawl rate, or always force recrawl, you may have the same problem.