Puffin is a deterministic deflate recompressor. It is mainly used for patching deflate compressed images (zip, gzip, etc.) because patches created from deflate files/streams are usually large (deflate has a bit-aligned format, hence, changing one byte in the raw data can cause the entire deflate stream to change drastically.)
Puffin has two tools (operations): puffdiff
and puffpatch
(shown here.) The purpose of puffdiff
operation is to create a patch between a source and a target file with one or both of them having some deflate streams. This patch is used in the puffpatch
operation to generate the target file from the source file deterministically. The patch itself is created by bsdiff
library (but can be replaced by any other diffing mechanism). But, before it uses bsdiff
to create the patch, we have to transform both the source and target files into a specific format so bsdiff
can produce smaller patches. We define puff
operation to perform such a transformation. The resulting stream is called a puff
stream. The reverse transformation (from puff
stream to deflate stream) is called a huff
operation. huff
is used in the client to transform the puff
stream back into its original deflate stream deterministically.
For information about deflate format see RFC 1951.
puff
is an operation that decompresses only the Huffman part of the deflate stream and keeps the structure of the LZ77 coding unchanged. This is roughly equivalent of decompressing ‘half way’.
huff
is the exact opposite of puff
and it deterministically converts the puff
stream back to its original deflate stream. This deterministic conversion is based on two facts:
puff
stream. This means the deflate stream can be reconstructed deterministically using the Huffman tables.The inclusion of Huffman tables in the puff
stream has minuscule space burden (on average maximum 320 bytes for each block. There is roughly one block per 32KB of uncompressed data).
bsdiff
of two puffed
streams has much smaller patch in comparison to their deflate streams, but it is larger than uncompressed streams.
The major benefits
puff
and huff
are deterministic operations.huff
is in order of 10X faster than full recompression. It is even faster than Huffman algorithm because it already has the Huffman tables and does not need to reconstruct it.The drawbacks
A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at different locations. When this file is puffed, the resulting puff
stream contains both puffs and the raw data that existed between the deflates in the compressed file. When performing huff
operation, the location of puffs in the puff
stream and deflates in the deflate stream should be passed upon. This is necessary as huff
operation has to know exactly where the locations of both puffs and deflates are. (See the following image)
Similarly puffpatch
requires deflates and puffs locations in both the source and target streams. These location information is saved using Protobufs in the patch generated by bsdiff
. One requirement for these two operations are that puffpatch
has to be efficient in the client. Client devices have limited memory and CPU bandwidth and it is necessary that each of these operations are performed with the most efficiency available. In order to achieve this efficiency a new operation can be added to bspatch
, that reads and writes into a puff
streams using special interfaces for puffing and huffing on the fly.
bsdiff
program.Depending on the scheme for storing the Huffman tables, the payload size can change. We discovered that the following scheme does not produce the smallest payload, but it is the most deterministic one. In a deflate stream, Huffman tables for literals/length and distances are themselves Huffman coded. In this format, we also puff
the Huffman tables into the puff
stream instead of completely decompressing them.
There are three tables stored in this structure very similar to the one defined in RFC 1951. A Huffman table can be defined as an array of unsigned integer code length values. Three Puffed Huffman tables appear like the following scheme. The first table (codes) is the Huffman table used to decode the next two Huffman tables. The second Huffman table is used to decode literals and lengths, and the third Huffman table is used to decode distances.
Literals lists are constructed by a “length” value followed by “length” bytes of literals. The Puffer should try to merge adjacent literals lists as much as possible into one literals list in the puff
stream. This Is a length value followed by length bytes of literals (Even if there is only one literal.)
This Is a Length value followed by a Distance value.
Currently Puffin is being used in both Android and Chrome OS and is built differently in each of them. There is also a Makefile build, but it is not comprehensive.
Puffin builds an executable puffin
which can be used to perform diffing and patching algorithms using Puffin format. To get the list of options available run:
puffin --help
It can also be used as a library (currently used by update_engine) that provides different APIs.
To compute the diff between our current state and the original changes: git diff 4180a65119ef2c333c4d33c9e39869da89a8faea -- .