22 January 2014

What day is it? Why document dates are so important to the Google Search Appliance.

I spend a lot of time working with Google Search Appliance administrators who have set up their own GSA without our help. One very common problem that I encounter when reviewing their implementations is that the GSA can't identify when their documents were published. As a result, they may have several serious problems. The most serious problem is crawl frequency. The GSA uses document dates to schedule recrawls, with a goal of crawling each document approximately twice as frequently as it changes. Without dates, the GSA can't build this optimal recrawl schedule. Their GSAs may crawl their content too frequently, causing the content servers to become unresponsive to other users' requests. Or, their GSAs may not crawl their content enough, resulting in an out-of-date and stale index. If you find yourself using Freshness Tuning or Host Load Settings to reduce or increase your crawl rate, or always force recrawl, you may have the same problem.



Search users might encounter another problem. Without valid document dates, users can't sort or filter by date. The GSA can't use dates for relevance, and administrators can't use date biasing. This affects the quality of search results, and the ability of users to find what they're looking for.

So, what goes wrong with document dates, and how can we fix it?

The most likely cause of the problem is that your web server doesn't provide the dates. Normally, the web server automatically provides these dates using an HTTP response header called "Last-Modified". The Last-Modified date is read by the web server from the file system. This response header is used by browsers for caching. When your browser saves a URL in its cache, it keeps track of the Last-Modified date. When the browser requests the same URL again, it'll send an HTTP request header called "If-Modified-Since" containing the Last-Modified date. If the document hasn't changed since the previous request, the web server will send back an HTTP 304 Not Modified status code, instead of the normal 200 OK, and will not send the document itself at all. The browser will then use the cached file. To see if your web server is returning Last-Modified response headers, you can use Chrome Developer Tools or the Firebug plugin in Mozilla Firefox.


By default, Apache and IIS will automatically return Last-Modified response headers for static content - HTML pages, Office documents, PDFs, and so on. But what if you're using an application server to generate HTML pages: something like Java, ASP.NET or Adobe ColdFusion? If so, the web server won't return a Last-Modified response header, as shown below. There's a good reason for this. The web server identifies the Last-Modified date from the file system, as described above, and that date isn't especially useful for a web application URL. So, ultimately, it's the responsibility of the web developer to return this response header from within the application logic. Each web application programming language has a way to generate response headers. In ASP.NET 4.x, you might use HttpResponse.AddHeader. In ColdFusion, you'd use CFHEADER. The real problem, of course, is identifying the date you actually want to use. If your database contains a datetime field that's used to track changes, that can be used to generate the response header. You should also make sure your application responds to If-Modified-Since request headers appropriately. It should return a 304 status code as described above, and should not return the actual page requested at all. For example, if you're using ColdFusion, you can use CFABORT to stop processing of the page request. This will significantly reduce the amount of work your application server has to do, and this will improve performance and reduce bandwidth requirements for all clients, not just the GSA.



Now, what if you can't modify your application code to generate HTTP response headers? If you're using a CMS, modifying response headers might not be an option. In that case, you can provide the date within the page, and tell the GSA to use that date instead of the response header. This is not nearly as good as using the response header, as it will still mean that the GSA - and every other client - will have to fetch the page whether it's changed or not, but it will at least allow the GSA to track the date, and use it to build a crawl schedule for the document. Under Crawl and Index ... Document Dates, the GSA can be instructed to look for document dates in another location. By default, of course, it's just looking for Last-Modified for all URLs.



You can add one or more URL patterns, then specify a date location for each. To add a new URL pattern, click on "Add More Lines". This will blank lines below the default pattern.



Unfortunately, these patterns are evaluated in order for each URL, so you'll need to move the default pattern below the others!



For a given URL pattern, you can choose one of five possible locations: URL, Meta Tag, Title, Body or Last Modified. If possible, use a meta tag for this - it's the best alternative to Last-Modified. You'll need to specify the name of the meta tag, and the date format used. The GSA uses the POSIX standard strptime system call to identify your format. The optimal format is ISO-8601 date format: four digit year, two digit month, and two digit day. For example, today is the 22nd of January, 2014, so the ISO-8601 date format for that is 2014-01-22. You can also optionally specify a locale and prefix string if needed - usually, they aren't. If you choose to use Body as the location, be warned: you may well have documents with multiple dates in the body!

Once you've added one or more document date rules, save your changes. You'll then have to wait a while for this to take effect, if you've previously crawled these documents already without the rule. Once this takes effect, your document date problems for that URL pattern will be solved!

[Note: cross-posted on the Fig Leaf Software blog]

No comments:

Post a Comment

All comments are subject to potentially unfair moderation. All comments are owned by the poster of said comments.