Tuesday 5 June 2007

HTTP Compression

I often wonder how does Google compress their pages on every request (Do they compress every request?). Some ideas that have come up with are:

All pages are generated and then compressed once, subsequent requests are issued from a cache - but it troubles me that they would do this. My main issue is that
a) according to the original white paper from Google all pages are stored compressed in the gzip format, hence requiring some decompression before processing. This combined with
b) the search results are highlighted with the exact phase searched.

If this is true then each result must be processed, requiring decompression, then compressed again to send back to the client. This just seems like a waste of processing and given the number of requests they need to serve it just doesn't seem likely.

But can we improve on this, I think so. While I am somewhat familiar with Huffman encoding, I do not know much more about LZ other than what I have read on Wikipedia. These techniques can be used to make up gzip, a likely compression technique used by google. Huffman and LZ both based on generating a key table that maps a common series of bits (or letters) to a shortened binary version
i.e <strong> is 6*8 = 48bits in ASCII (and UTF-8 I think), however a table may map that sequence to a much smaller series of bits, say 12bits due to its relatively high freqency in HTML documents. The basis for this compression is in making frequent patterns short while infrequent pattern are longer, and space is not wasted on non-existent pattens resulting in a shorter overall length. In fact UTF-8 appears to be based on Huffman coding, but that's for another post.

This key table generally changes for every block of data encoded, but then lets think about what Google are compressing - HTML. They may have a predetermined table (or set of tables, perhaps indicated by a header) that is used for all documents. I think that makes sense, and fits with Google publishing statistics on common web tags and attributes, as they may use these statistics to generate new encoding tables.

Now lets assume they don't encode strings with spaces as a combination of characters. For example " we " is 4 characters including spaces, and is hence not a subset of " welcome". These two strings would have completely unrelated compressed character codes. So if we accept the assumption that spaces are not integrated into a keyed entry in a compression table we can do some powerful text manipulation while the content is compressed - although only on full words.

If we compress the search text (or tokens/words) then we can easily compare this to the compressed results - they should have the exact bit pattern if they use the same key table. It may not be quite this simple - a series of other concatenated bit patterns may match your search bit pattern so some parsing logic may be involved. I can't think of a great way of achieving this yet because in order to have unique bit pattern we would have to have a long code. Else we would have to perform some processing on each extended code when generating our key table to make sure that this bit pattern does not repeat. This would adversely affect the compression ratio that could be acheived, although it may be a small trade off overall. Using fixed length key codes may help. For example if each key code resulted in a length that is multiple of 4 bits then this means 8 times (2^4) fewer possible occurences of the unique code in longer codes.

If this process does work then you may now inject some compressed bits into the stream - namely some compressed <strong> tags to highlight the search word(s). Viola, we have read and manipulated the string pattern while it is still compressed without fully decompressing or re-compressing it!

The other reason I like this idea is that by using a common and constant compression table they can make each page request be constructed from parts of HTML. If the key table is the same for all requests, then generating the header is simple and blocks can be concatenated, which is not normally possible as they are based on different key tables. For example a single search result can be concatenated to the previous result, which taking a step back makes a good platform for distributed processing and caching. Once they have built this type of compression engine they can use it for pretty much any page, from news to mail. Maybe that is the Google advantage?

Maybe they don't do anything like this, I really have no idea, but this sounds like a possible solution and meets all the functional and performance requirements that I can think of right now. I might go and investigate it some more.

Any thoughts?

No comments: