31 July 2015

Screen scraper changes between GSA 7.0 and 7.2+

Background

A few years back, the Google team at Fig Leaf Software built a custom application in .NET to manage Google Search Appliance functionality using a rules engine of sorts, rather than requiring manual interaction with the GSA. I didn't build it; we have a very skilled .NET development team that did almost all of the heavy lifting, and all I had to do was build a very simple prototype.



To interact programmatically with the GSA, Google provides an administrative API that can be accessed from any language, along with client libraries for .NET and Java. Unfortunately, not all admin console functionality is exposed via the admin API. You probably know what that means - screen scraping is needed if you want to access that functionality. Writing screen scrapers is no fun, because the data format is likely to change pretty frequently, and that's exactly what happened when this customer upgraded from GSA 7.0 to 7.2. All the admin API functionality worked, but the ability to upload and delete synonym files no longer worked, because that relied on screen scraping.

Of course, the request/response format for screen scraping is undocumented, so you typically have to figure out what the server is looking for using a recording proxy or packet sniffer. I really like Fiddler for this sort of thing. It's a lot less complicated than something like Wireshark, and really shows you everything you need to see in HTTP. That's basically the approach we followed to build the initial application, and it's how we upgraded it to support GSA 7.2/7.4. If you know how to build screen scrapers, there's nothing you can't figure out on your own, but I thought this might save a few valuable hours for someone out there.

Login process

The first problem the customer reported was that the login was failing on 7.2. Foolishly, I thought that would be the only change - what was I thinking? Nevertheless, that obviously had to be resolved first, so I took a look at the login process against a 7.0 vs a 7.2 GSA.

On 7.0, the login process is pretty simple. The client sends a GET request, and the first response from the GSA sets a session cookie. Then, the client sends a POST request with a MIME type of application/x-www-urlencoded that contains three parameters, like this (assuming a username  "admin" and password "figleaf"):

POST /EnterpriseController HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: gsa.figleaf.local:8000
Cookie: S=enterprise=P10OlkGyca0
Content-Length: 60
Expect: 100-continue

actionType=authenticateUser&username=admin&password=figleaf

On 7.2, things are a bit more complicated. The HTML form on the GSA itself creates two parameters, actionType and reqObj. The second parameter is an array, and prior to being URL-encoded contains a value like this:

reqObj=[null,"admin","figleaf",null,1]

I have no idea what the other parameters represent, but they don't seem to change, so I don't care! The string containing the parameters must be URL-encoded, so you end up with something like this for the entire POST request:

POST EnterpriseController?a=1 HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: gsa.figleaf.local:8000
Cookie: S=enterprise=P10OlkGyca0
Content-Length: 87
Expect: 100-continue

actionType=authenticateUser&reqObj=%5Bnull%2C%22admin%22%2C%22figleaf%22%2Cnull%2C1%5D

There's another difference, too - the action URL has an extra parameter, a=1. I don't really know what that represents, but you won't successfully login without it.

One last issue, which will occur with many screen scraping operations, is that you typically want to search the response for specific bits of text to extract values, or learn whether the operation succeeded. Many of these had changed between 7.0 and 7.2. Specifically, for 7.2, we can look for the text "login_err" which is actually quite nice!

Query Settings page

Once that was working, we quickly discovered that the synonym management functionality didn't work. Since that was the entire purpose of this screen scraper, fixing the login wasn't enough.

To upload a new synonym file, you need a POST request with a MIME type of multipart/form-data. Here's what that looks like for GSA 7.0:

POST /EnterpriseController HTTP/1.1
Content-Type: multipart/form-data; boundary=----------8d2993bd51d1414
Host: gsa.figleaf.local:8000
Cookie: S=enterprise=KbhJUSnSvA4
Content-Length: 853
Expect: 100-continue

------------8d2993bd51d1414
Content-Disposition: form-data; name="type";

0
------------8d2993bd51d1414
Content-Disposition: form-data; name="syn_lang_select";

en
------------8d2993bd51d1414
Content-Disposition: form-data; name="sw_lang_select";

all
------------8d2993bd51d1414
Content-Disposition: form-data; name="itemName";

tst_synonyms_en
------------8d2993bd51d1414
Content-Disposition: form-data; name="actionType";

updateQueryExp
------------8d2993bd51d1414
Content-Disposition: form-data; name="security_token";

AJhxEn3QZlHwjh8Idgd_q51Fh8c:1438315574127
------------8d2993bd51d1414
Content-Disposition: form-data; name="upload";

Upload
------------8d2993bd51d1414
Content-Disposition: form-data; name="fileName"; filename="tst_en.txt"
Content-Type: text/plain

{test1,test2}
------------8d2993bd51d1414--

With GSA 7.2, the "type" field is now "qeType", and you need to add "a=1" again.

POST /EnterpriseController HTTP/1.1
Content-Type: multipart/form-data; boundary=----------8d2993d49e9990f
Host: gsa.figleaf.local:8000
Cookie: S=enterprise=ccjbLm2KD2Y
Content-Length: 932
Expect: 100-continue

------------8d2993d49e9990f
Content-Disposition: form-data; name="qeType";

0
------------8d2993d49e9990f
Content-Disposition: form-data; name="a";

1
------------8d2993d49e9990f
Content-Disposition: form-data; name="syn_lang_select";

fr
------------8d2993d49e9990f
Content-Disposition: form-data; name="sw_lang_select";

all
------------8d2993d49e9990f
Content-Disposition: form-data; name="itemName";

tst_synonyms_en
------------8d2993d49e9990f
Content-Disposition: form-data; name="actionType";

updateQueryExp
------------8d2993d49e9990f
Content-Disposition: form-data; name="security_token";

bYkoFblS1Nm70sAOrPeGcdFgF04:1438316205640
------------8d2993d49e9990f
Content-Disposition: form-data; name="upload";

Upload
------------8d2993d49e9990f
Content-Disposition: form-data; name="fileName"; filename="tst_en.txt"
Content-Type: text/plain

{test1,test2}
------------8d2993d49e9990f--

Odds and ends

I didn't run into this myself, but you may run into AJAX calls for some changes. The GSA admin console has changed quite a bit in GSA 7.2, and uses AJAX for some functionality. If so, you'll need to extract the security token from the previous AJAX response rather than from the form. For each data submission, a one-time-use security token is injected into each form or AJAX response, and you have to send it back with the subsequent request. In the case of synonyms, the security token is still in a form, but I did notice it in some of the AJAX responses I got while doing other things.

Conclusion

Ideally, we should never have to write screen scrapers. If you need something that isn't exposed by the admin API, open a support ticket and submit a feature request - maybe it'll be in the next admin API version! But if you have an existing screen scraper for GSA 7.0 and can't wait for a Google API upgrade, you may find this useful when upgrading to GSA 7.2 or higher.

[Note: cross-posted on the Fig Leaf Software blog]







No comments:

Post a Comment

All comments are subject to potentially unfair moderation. All comments are owned by the poster of said comments.