-
Notifications
You must be signed in to change notification settings - Fork 4
Chunking
Chunking is a technique for data deduplication, i.e. the elimination of duplicate copies of repeating data.
In short, when a build is processed, its files are cut into multiple smaller pieces, called chunks. If a build is changed, in general, only a small amount of the chunks were touched and changed.
Chunks represent parts of a files content. The following illustrates a file that was chunked into 6 chunks: A
, B
, C
, D
, E
and F
.
A chunk is typically referred to by its hash. The hash is retrieved by using a cryptographic hash function, like SHA-1, on the content of the chunk, an example for such a hash would be 2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
. It is generally assumed that a hash uniquely identifies its content. So if two chunks have the same hash, they are assumed to contain the same data.
When a build is changed, the majority of chunks remain the same. As an example, consult the following illustration.
In this scenario, a specific part of the file is changed, the text "foo"
was replaced by "bar"
. The affected chunk is chunk C
, its content after modification differs from its original content. However, all other chunks (A
, B
, C
, E
and F
) stay untouched. As a result, only the new chunk C*
has to be stored additionally.
Chunking systems are primarily used for data deduplication. In a big enough build it is highly likely that multiple, small enough regions in its files contain the exact same content. For example, the word "hello"
may be present in multiple files or even multiple times in the same file. Repeating data does not have to be stored multiple times, one copy is already enough. Data deduplication aims at detecting such duplicates and eliminating them.
Hence, when chunking all files in a build, it is likely to receive multiple chunks that have the same content, i.e. they have the same hash. When storing chunks, only one of those duplicates has to be stored and, by that, the size of the build, as stored, is reduced.
The size of a chunk plays a very high role. Generally speaking, the smaller it is, the higher the data deduplication, the smaller the build size. The majority of deduplication systems prefer a slice size between 1 KB
and 500 KB
.
However, in the context of up- and downloads, choosing a very small chunk size also yields to an increased amount of connections, as each chunk has to be transferred individually. There exist mitigation techniques were multiple chunks are bundled and uploaded together as single entity, but choosing an effective bundle strategy is not trivial either.
In the following, different strategies for chunking are presented.
- Fixed-Size-Chunking (
FSC
) - Content-Defined-Chunking (
CDC
)FastCDC
FSC is the simplest approach to chunking. Files are simply cut into chunks of fixed size.
As an example, suppose a file consisting of 5 MB
with a chunk size of 1 MB
. FSC will create 5
chunks, each with a size of 1 MB
. If the file is not divisible by the chunk size, there will be a last chunk consisting of the remaining data. For example a file with 2.3 MB
data results in 3
chunks with 1 MB
, 1 MB
and 0.3
MB each.
An advantage of this technique is that it is computationally easy, a build can thus be chunked very fast which contributes to a fast processing time. The fact that the vast majority of chunks have the exact same size is also simplifying logic that has to work with those chunks.
FSC suffers heavily from what is known as Boundary-Shift-Problem. The technique is naive and cuts a file sequentially into fixed sized chunks, it does not consider the actual data that is subject to be chunked.
Consider the following scenario.
Here, new content is inserted into a file. As a result, all subsequent data in the file is shifted. In the illustration, dotted lines indicate the boundaries of the previous chunks with color representing the respective previous chunk content.
FSC can not detect this, it will simply cut the file each time the chunk size has been reached, resulting in all subsequent chunks being different, so C*
, D*
, E*
, F*
and G
have to be stored additionally. The chunks C
, D
, E
and F
have to be thrown away, although the actual content of the build merely changed.
In the worst case this means that inserting a single byte to the beginning of a 50 GB
file results in a patch size of 50 GB
, as all chunks are different.
The only way to avoid the Boundary-Shift-Problem with FSC is to either only replace content without changing the size or to append changes to the end of a file.
CDC is a chunking technique to address the Boundary-Shift-Problem. Instead of chunking after a fixed size, it interprets the actual data that is subject to be chunked and only cuts when the data meets a certain criteria.
For an example, suppose the simple, but poor, strategy to define a cut-point each time the binary pattern 011100
is detected.
Since cut-points are defined based on the actual content, inserting new data will preserve cut-points as the content is only shifted, not changed.
Considering the previous example
although data was inserted and the remaining content shifted, only chunk C*
is modified. Chunks D
, E
and F
remain untouched since cut-points are preserved.
A disadvantage of this poor strategy is that the chunk size is extremely variable and, in worst-case, could even span across the whole file if the pattern is not present at all.
Fast-Content-Defined-Chunking (see publication) is an efficient chunking algorithm for content-defined-chunking based on rolling hashes. It was developed as faster alternative to existing CDC algorithms using Rabin- or fingerprints.
The algorithm uses a sliding window of 48 bit
s, also called fingerprint, implemented as rolling hash. The algorithm reads the file byte-wise. To address poor slicing characteristics due to bit-patterns that occur more frequent in practice, like 0000...
or 1111...
, each byte is mapped to another byte before it is taken into account for the fingerprint calculation. The exact mapping, known as gear, is random but deterministic and just acts as noise. The resulting byte is then added to the fingerprint, while shifting it as the file is read:
fingerprint = (fingerprint << 1) + gear[byteFromFile];
The fingerprint is then checked against a mask and if they match, a cut-point is created.
if ((fingerprint & mask) == 0) ...
The mask has a size of 48 bits
and is generated randomly but deterministic. The amount of 1-bits
and 0-bits
is chosen carefully based on the expected chunk size. It is given by the binary-logarithm (log_2
) of the expected chunk size.
To reach an expected chunk size of 1 MB, ceil(log_2(1000000)) = 20
1-bits have to be used. The exact distribution can be chosen randomly or derived empirically. As an example, the mask
110110010000001111000011010100110011010001100000
could be used.
To achieve an average chunk size that is closer to the expected chunk size, the algorithm uses two different masks that make cut-point creation more or less likely respectively. This is called normalization. MaskSmall
, used to make cut-point creation less likely, has two 1-bits
more and MaskLarge
, which makes cut-point creation more likely, has two 1-bits
less. The algorithm then uses MaskSmall
as long as the currently collected data is still less than the expected slice size and MaskLarge
as soon as the data is more than the expected chunk size. Example values for an expected chunk size of 1 MB
could be:
MaskSmall = 110110010000001111100011010110110011010001100000
MaskLarge = 110110010000001011000011010000110011010001100000
The amount of 1-bits
the masks differ can be altered; the more, the closer the average chunk size is to the expected chunk size, but the less the data deduplication. This is referred to as normalization level, the default is 2
. If all bits are set to 1, FastCDC degrades to FSC.
In order to avoid chunks that are sized extremely small or large, the algorithm further employs a technique called cut-point skipping.
Therefore, a minimal and maximal chunk size, next to the expected chunk size, are introduced. Before the cut-point computation is started, the algorithm will always first attempt to consume the minimal amount of data and by that avoid having small chunks generated. Additionally, whenever the currently collected data exceeds the maximal chunk size, a cut-point is forced, avoiding generation of excessively large chunks.
In practice, the minimal and maximal chunk sizes are typically set to a factor of 0.25-
and 8-times
the expected chunk size, respectively. So for an expected chunk size of 1 MB
, the minimal chunk size is 0.25 MB
and the maximal chunk size is 8 MB
.