Thursday, May 27, 2010

Flate Compression

Dynamic Huffman Compression

The Data Block


The following is a real-world data block encoded using dynamic huffman compression. We are going to apply the decompression rules to this block to understand what happens under the hood.

Here is the memory dump of the compressed stream:
XX00: 48 89 AC 93 5F 4F C2 30-14 C5 DF FB 29 EE A3 FA
xx10: D0 AE 5B D7 76 09 21 61-13 7D 41 A2 A6 C9 1E 08

The way to read this data is to arrange it in the reverse order and extract bits from the right hand end:

93 AC 89 48
10010011 10101100 10001001 01001000

30 C2 4F 5F
00110000 11000010 01001111 01011111

The first two bytes have a special significance and are not strictly a part of the compressed stream itself.
The first byte: 48 must contain an 8 in the least significant nibble.
The second byte: 89 must be such that the combined integer formed by the first two bytes (0x4889) must be divisible by 0x1F. It is a checksum.

The Compressed Stream

The compressed stream starts from the third byte onwards. Now consider the following bit-buffer available from the next 3 bytes:

5F 93 AC
01011111 10010011 10101100

Extract from the right hand side:
1) 3 bits: 100 - This is the header
|||
|| ---- 1 indicates that the block is not the last one

------ 2 bits indicate dynamic huffman compression
2) 5 bits: 10101 - This is the number of literal codes (21 + 257)
3) 5 bits: 10011 - This is the number of distance codes (19 + 1)
4) 4 bits: 1100 - This is the number of code lengths (12 + 4)

Code Lengths

The next set of data consists of 3-bit code lengths. There are 16 of these in the stream:

30 C2 4F 5F
00110000 11000010 01001111 01011111

FB DF C5 14
11111011 11011111 11000101 00010100

So the numbers are (going right to left in the sequence above):
7 5 6 3 2 2 0 3 0 3 0 5 0 5 0 7

What do these numbers represent? They are code lengths for a huffman code. That's the next topic for discussion.

Huffman Coding
Huffman coding tries to reduce the number of bytes required for storing a given set of data. It does this by using smaller number of bits to represent more frequently occurring symbols more bits for less frequently occurring symbols. Thereby, the total number of bits required for storage of the same data is less.

Consider the following character set with the given probability of occurrence of the different symbols:
A: 0.11
B: 0.14
C: 0.12
D: 0.13
E: 0.24
F: 0.26

The huffman code is constructed by creating a binary tree as follows:

A. arrange the symbols in ascending order of probability.

A: 0.11
C: 0.12
D: 0.13
B: 0.14
E: 0.24
F: 0.26

B. Then select the top two elements and combine them into a new element. The probability of occurrence of the new element is the sum of its constituent elements. Create a new list including the new element.

D: 0.13
B: 0.14
C1: 0.23 (C1 = A + C)
E: 0.24
F: 0.26

Repeat till only two elements remain.

C1: 0.23
E: 0.24
F: 0.26
C2: 0.27 (C2 = D + B)

Repeat:

F: 0.26
C2: 0.27 (C2 = D + B)
C3: 0.47 (C3 = C1 + E)

Repeat:

C3: 0.47
C4: 0.53 (C4 = F + C2)

Now we get the huffman tree:
C5
/ \
C3 C4
/ \ / \
C1 E F C2
/ \ / \
A C D B

Now, in the above tree, if all left branches are labelled 0 and right branches are labeled 1, then the following codes are obtained for the various symbols:

A: 000
B: 111
C: 001
D: 110
E: 01
F: 10

For flate coding, the tree has to be rearranged as follows:
o
/ \
o o
/ \ / \
E F o o
/ \ / \
A B C D

The above tree uses the same number of bits per code but it has two special constraints:
a) the smallest codes appear to the left
b) codes of the same length are lexicographically sequential

The resultant codes are:
A: 100
B: 101
C: 110
D: 111
E: 00
F: 01

With these constraints, it is sufficient to pass the lengths of the codes to be able to construct the huffman code tree.

So, the sequence 3, 3, 3, 3, 2, 2 corresponding to A, B, C, D, E, F is sufficient to construct the codes. This can be shown simplistically as follows:

Select the codes of length 2. Since there are two of them, these will be 00 and 01. Now to construct the codes of length 3, the starting point becomes 1 more than 01 followed by the third bit. Which means the third bit is added to 10. So the first code is 100. The subsequent codes are 101, 110, 111.

Consequently: A: 100, B: 101, C: 110, D: 111, E: 00, F: 01

Code Lengths (Again)

Coming back to the sequence we had started with:
7, 5, 6, 3, 2, 2, 0, 3, 0, 3, 0, 5, 0, 5, 0, 7

However, this sequence is itself permuted with the first number corresponding to the 16th, second to the 17th etc. according to the following:
16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

So, if we rearrange the sequence, we get

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18
3 0 7 5 5 3 3 2 2 0 0 0 0 0 0 0 7 5 6

So the sequence 3, 0, 7, 5, 5, 3, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 7, 5, 6 represents a huffman code tree. This means that by following an algorithm, we should be able to construct the huffman codes for the symbol set:
0: 100 001
1: -
2: 1111110
3: 11100
4: 11101
5: 101
6: 110
7: 00
8: 01
9: -
10: -
11: -
12: -
13: -
14: -
15: -
16:
1111111
17:11110
18:111110

Now the following data stream is available:

EE 29 FB DF C5 14
11101110 00101001 11111011 11011111 11000101 00010100

It has to be read in a bit-reversed manner:
1111011
17

Now, 17 is a repeat indicator. It will be followed by 3 bits indicating the repetition count.

Now what is left in the buffer is just the bits 11. Now if we fill the buffer with the next byte, we will get the pattern:
11111011 11
which is
11 11101111
The last 3 bits are 111 meaning that 0 is repeated 7 + 3 times = 10 times. This means that the first 10 bytes of the encoded data are decoded as 10 zeroes.

Now that these are consumed, the buffer is:
1111101

This provides the pattern 101 indicating the value 5. This is a literal and should be copied to the output stream as such.

Getting the next 2 bytes into the buffer yields:
11101110 00101001 1111

Consuming the next huffman code
01 1111 corresponds to 18 which represents a long sequence of zeroes. Consuming 18 leaves the buffer as:
11101110 001010

The next 7 bits represents the repetition of zeroes: 0 001010 => 0x0A + 11 => 21 (decimal)

Literal Table and Distance Tables
How much such data are we going to consume? The amount specified in the first 3 bytes of the compressed stream represented the lengths of the literal table and distance table (as well as the lengths table that we already extracted).

We have to now consume the data whose length is specified in the literal and distance tables lengths. In the example given, this is 278 and 20 respectively. So we have to consume 298 codes in the manner just explained.

What is the data that is decoded as a result? The data is two more huffman code trees - one with 278 entries and the other with 20 entries. And what does this mean? It means that the 278 codes that we have decoded are the sequence code lengths from which the huffman tree can be reconstructed.

So, now, we have 278 entries with varied code lengths. We use the algorithm to generate the huffman codes and we get 278 codes. These are distributed as follows:
0 - 255: the corresponding codes represent literals
256: this is a special code indicating the end of the block
257 - 278: these are the length codes of the LZ77 compression

Likewise for the 20 distance values.

Now we can read the remaining data in the compressed stream using the codes corresponding to the 278 entries.

And that gives us the data we were looking for.


No comments:

Post a Comment