Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteBuffer or InputStream support for VCDiffEncoder Dictionary #6

Open
Omkar-Shetkar opened this issue Jan 17, 2020 · 3 comments
Open

Comments

@Omkar-Shetkar
Copy link

VCDiffEncoder can be created using VCDiffEncoderBuilder. Here, for source content we need to specify byte[] as input to withDictionary().


  public synchronized VCDiffEncoderBuilder withDictionary(byte[] dictionary) {
        this.dictionary = dictionary;
        return this;
    }

Source content can be as large as > 1GB. I think for better performance for large files Dictionary can be accepted as either ByteBuffer or InputStream.
I think this is the most common use case while using this library for large files.
Will you please consider this change in your next release ?
Thanks.

@ehrmann
Copy link
Owner

ehrmann commented Jan 18, 2020

A ByteBuffer would be pretty straightforward since the backing code already uses one. Assuming it's a MappedByteBuffer, you'd see a performance hit while encoding because the dictionary isn't in memory.

Are you just looking for better initialization performance? The dictionary needs to be loaded into memory as soon as the encoder is created/used, so the only benefit to a ByteBuffer or InputStream would be pipelining one of the load steps. The byte[] that's passed in is used internally.

@Omkar-Shetkar
Copy link
Author

If I understood correctly, for encoding we need to have whole of dictionary content into memory. If yes, I think for large files and high traffic applications this could cause out of memory issues. I was wondering is there any way we can provide dictionary in chunks similar to VCDiffStreamingDecoder.decodeChunk() while decoding.

@ehrmann
Copy link
Owner

ehrmann commented Jan 19, 2020

for encoding we need to have whole of dictionary content into memory

More or less (ignoring memory-mapped files and swapping). The next chunk of data could reference any part of the dictionary, and you'd have to check it.

I was wondering is there any way we can provide dictionary in chunks

Both encoding and decoding can be done on data chunks because the chunk can be compressed looking at the dictionary (or previous output), and the output written in a chunk. This doesn't work for dictionaries because any part of the dictionary can be referenced during encoding and decoding.

for large files and high traffic applications this could cause out of memory issues

You can share the same dictionary byte[] between requests (vcdiff-java doesn't modify it), but yes, you could see issues. Using a mapped ByteBuffer would also cause a lot of IO. The dictionary gets turned into a BlockHash for fast lookups during encoding. This also has memory overhead.

Adding support for a ByteBuffer dictionary is pretty straightforward, but I'm not sure if it's what you really want. It sounds like a 1GB dictionary is too big for where you're running. You might want to gzip the compressed data; vcdiff doesn't do any huffman encoding. This is a little bit like using xz; depending on the settings, it's easy for it to use more memory than your system has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants