13 February 2014

New GSA Specifications and Usage Limits document, and GSA maximum file sizes

Google Search Appliance documentation has, historically, been pretty good at explaining how the GSA works, but not so good about documenting the allowed ranges for a lot of configuration options. Typically, the way you'd find them out was by exceeding one of them, and seeing that things didn't work as expected. You'd then open a Google support ticket, and the support engineer would tell you, "You can't have more than X of those." This has been mildly frustrating, especially for me as a GSA instructor. When a student asked what the maximum value for a given field is, I'd have to either rely on my own experience or say I didn't know.


GSA Specifications and Usage Limits

But that changes today! Google has, for the first time, released a list of specifications for GSA 7.2 that tell us the allowed maximum values for many commonly-used configuration values. In most cases, you're unlikely to hit those maximum values, but it's still good to know what they are! For example, I didn't know that the maximum file size that can be accepted by the GSA is 2048 MB. I don't think I'll ever have a document that large for the GSA, but now I know I can do that! I don't know if these values are the same for previous GSA software versions, however.

GSA Specifications and Usage Limits document

GSA Maximum File Sizes

On a somewhat related note, I've noticed that there's some confusion about file sizes on the GSA - addressing this is how I learned about this document in the first place. There are two relevant file size values that you need to know about. First, there are two maximum file sizes that the GSA will download: one for HTML pages and plain text files (any MIME types that begin with text, such as text/html and text/plain), and one for all other files. If a file exceeds this size, the GSA will simply ignore the file as if it didn't even exist. The GSA learns the file size from the Content-Length HTTP response header sent by the web server. If the value in that header is too large, the GSA will simply ignore the remainder of the response. Historically, these values were set in the GSA and were not changeable, but in the 6.x software versions they could be increased if needed. By default, the maximum size of an text file is 20 MB, and the maximum size of a non-text file is 100 MB. Again, though, those numbers can be increased to 2048 MB if you really need to. These numbers are set in the Host Load Schedule page in the GSA admin console.

Maximum File Sizes under Host Load Schedule

The second relevant number is the amount of text that will actually be indexed. There is only one value here. By default, that value is 2.5 MB, but it can be increased to a maximum of 10 MB. This number is set in the Index Settings page in the GSA admin console. Now, that number is a lot smaller than the maximum file size number! This is for a very good reason, though. After all, 2.5 MB is a lot of text! Most books, no matter how large they are, contain far less than that amount of text. And the GSA is only interested in the text. Images and formatting are not indexed by the GSA.

Index Limits under Index Settings

The last piece of the puzzle is that the GSA actually only indexes one document format: HTML. If you're using the GSA now to index your PDFs and MS Office documents, you might find this surprising! But in actuality the GSA doesn't actually index those files as-is. It first converts them to HTML, then indexes the HTML. So the important thing to identify is, how large is the HTML that you get when you convert your PDF to HTML? The HTML output of the conversion process doesn't contain images or physical formatting, so it will typically be very small even if you start with a 100 MB PDF. As long as that HTML is under 2.5 MB or whatever value you've set under Index Settings, all the text in the document will be indexed. As a result, it's not usually necessary to do any kind of preprocessing of large files to break them into chunks or anything like that, like we had to do in the "old days" of GSA 5.x, when the maximum file size was an unchangeable 20 MB.

These file sizes also apply to documents from content feeds.

Conclusion

I'm very excited by the new GSA 7.2 release, and all of the support - including this new documentation - that Google is putting behind it. Don't forget, Google is offering free webinars on GSA 7.2 - sign up today!

And of course, if you have any GSA questions, please feel free to send them to google@figleaf.com and we'll be happy to respond!

[Note: cross-posted on the Fig Leaf Software blog]

No comments:

Post a Comment

All comments are subject to potentially unfair moderation. All comments are owned by the poster of said comments.