GSA Specifications and Usage Limits
But that changes today! Google has, for the first time, released a list of specifications for GSA 7.2 that tell us the allowed maximum values for many commonly-used configuration values. In most cases, you're unlikely to hit those maximum values, but it's still good to know what they are! For example, I didn't know that the maximum file size that can be accepted by the GSA is 2048 MB. I don't think I'll ever have a document that large for the GSA, but now I know I can do that! I don't know if these values are the same for previous GSA software versions, however.GSA Specifications and Usage Limits document |
GSA Maximum File Sizes
On a somewhat related note, I've noticed that there's some confusion about file sizes on the GSA - addressing this is how I learned about this document in the first place. There are two relevant file size values that you need to know about. First, there are two maximum file sizes that the GSA will download: one for HTML pages and plain text files (any MIME types that begin with text, such as text/html and text/plain), and one for all other files. If a file exceeds this size, the GSA will simply ignore the file as if it didn't even exist. The GSA learns the file size from the Content-Length HTTP response header sent by the web server. If the value in that header is too large, the GSA will simply ignore the remainder of the response. Historically, these values were set in the GSA and were not changeable, but in the 6.x software versions they could be increased if needed. By default, the maximum size of an text file is 20 MB, and the maximum size of a non-text file is 100 MB. Again, though, those numbers can be increased to 2048 MB if you really need to. These numbers are set in the Host Load Schedule page in the GSA admin console.Maximum File Sizes under Host Load Schedule |
Index Limits under Index Settings |
The last piece of the puzzle is that the GSA actually only indexes one document format: HTML. If you're using the GSA now to index your PDFs and MS Office documents, you might find this surprising! But in actuality the GSA doesn't actually index those files as-is. It first converts them to HTML, then indexes the HTML. So the important thing to identify is, how large is the HTML that you get when you convert your PDF to HTML? The HTML output of the conversion process doesn't contain images or physical formatting, so it will typically be very small even if you start with a 100 MB PDF. As long as that HTML is under 2.5 MB or whatever value you've set under Index Settings, all the text in the document will be indexed. As a result, it's not usually necessary to do any kind of preprocessing of large files to break them into chunks or anything like that, like we had to do in the "old days" of GSA 5.x, when the maximum file size was an unchangeable 20 MB.
These file sizes also apply to documents from content feeds.
Conclusion
I'm very excited by the new GSA 7.2 release, and all of the support - including this new documentation - that Google is putting behind it. Don't forget, Google is offering free webinars on GSA 7.2 - sign up today!And of course, if you have any GSA questions, please feel free to send them to google@figleaf.com and we'll be happy to respond!
[Note: cross-posted on the Fig Leaf Software blog]
No comments:
Post a Comment
All comments are subject to potentially unfair moderation. All comments are owned by the poster of said comments.