-
Notifications
You must be signed in to change notification settings - Fork 4
Home
FastCDC4J is a fast and efficient content-defined chunking solution for data deduplication implementing the FastCDC algorithm and offering the functionality as simple library.
It is able to split files into chunks, based on the content. Chunks are created deterministic and will likely be preserved even if the file is modified or data moved, hence it can be used for data deduplication. It offers chunking of:
InputStream
byte[]
-
Path
, including directory traversal Stream<Path>
By utilizing the following built-in chunkers:
- FastCDC - Wen Xia et al. (publication)
- modified FastCDC - Nathan Fiedler (source)
- Fixed-Size-Chunking
And providing a high degree of customizability by offering ways to manipulate the algorithm.
The main interface of the chunkers provide the following methods:
Iterable<Chunk> chunk(InputStream stream, long size)
Iterable<Chunk> chunk(final byte[] data)
Iterable<Chunk> chunk(final Path path)
Iterable<Chunk> chunk(final Stream<? extends Path> paths)
- Requires at least Java 14
- Integrate FastCDC4J into your project. Download the
jar
from the release section. - Create a chunker using
ChunkerBuilder
- Chunk files using the methods offered by
Chunker
Suppose you have a directory filled with lots of files that is frequently modified and results have to be uploaded to a server. However, you want to skip upload for data that was already uploaded in the past.
Hence you want to chunk your files and setup a local chunk file cache. If a chunk is already contained from a previous upload, upload can be skipped.
var buildPath = ...
var cachePath = ...
var chunker = new ChunkerBuilder().build();
var chunks = chunker.chunk(buildPath);
for (Chunk chunk : chunks) {
var chunkPath = cachePath.resolve(chunk.getHexHash());
if (!Files.exists(chunkPath)) {
Files.write(chunkPath, chunk.getData());
// Upload chunk ...
}
}
Even if files in the build are modified or data is shifted around, chunks will likely be preserved, resulting in an efficient data deduplication.
The chunker builder ChunkerBuilder
offers highly customizable algorithms. Offered built-in chunkers are:
FastCDC
-
Nlfiedler Rust
- a modified variant ofFastCDC
Fixed Size Chunking
It is also possible to add custom chunkers either by implementing the interface Chunker
or by implementing the simplified interface IterativeStreamChunkerCore
.
A chunker can be set by using setChunkerOption(ChunkerOption)
, setChunkerCore(IterativeStreamChunkerCore)
and setChunker(Chunker)
.
The chunkers will try to strive for an expected chunk size setable by setExpectedChunkSize(int)
.
Most of the chunkers internally use a hash table as source for predicted noise to steer the algorithm, a custom table can be provided by setHashTable(long[])
.
Alternatively, setHashTableOption(HashTableOption)
can be used to choose from predefined tables:
RTPal
Nlfiedler Rust
After a chunk has been read, a hash is generated based on its content. The algorithm used for this process can be set by setHashMethod(String)
, it has to be supported and accepted by java.security.MessageDigest
.
Finally, a chunker using the selected properties can be created using build()
.
The default configuration of the builder is:
ChunkerOption#FAST_CDC
8 * 1024
HashTableOption#RTPAL
SHA-1
The methods fastCdc()
, nlFiedlerRust()
and fsc()
can be used to get a configuration that uses the given algorithms as originally proposed.