-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5
Comments
Hi Dave! Thanks for the question. It is encoding the 2 byte offset as 4 symbols, one per nibble (half byte). The encoder creates a huffman coding table for each nibble, and it turns out that scheme is more efficient (at least experimentally determined by me) then encoding it as 2 single byte values, each with its own huffman table. I believe the reason for this is that most of the length values fit in a nibble, so separating that high nibble out which is often a zero is high value. So in short, the values are not written out directly, they are written as symbols compressed with huffman coding, and you have to make a decision as to how to map an integer into a huffman encoding (do you have 1 huffman table for the entire 4 bytes? break it into 2 bytes and have 2 tables, 1 for each part? Break it into 4 parts and have 4 tables)? Please let me know if that doesn't make sense and I can explain further. |
I think that mostly makes sense, but I'm still a little surprised that huffman encoding beats a varint. Have you gotten a chance to look at some of my pull requests? I think I had a few performance tweaks, a build fix, and a bug fix for the Java version. |
I assume when you say varint you are referring to an ad hoc encoding that tries to encode a small number as a single byte, and then uses a sentinel value to indicate that the subsequent bytes represent the full number? Is there a specific implementation you have in mind that I could contemplate? One thing to keep in mind is that the numbers being encoded are often very small (like < 16). So I guess the basic question is what is the best variable encoding for numbers that are often very very small but can range up to 65k. Huffman encoding should always beat it since its a similar encoding, but optimized for the data set at hand. In fact assuming you have suitable sample data to generate the model, the huffman coding is provably optimal for a symbol to symbol encoding (vs say arithmetic coding, which I found empirically to not be much better then huffman coding and comes with significant performance cost and lots of IP fud). Perhaps whats more surprising is that I found breaking the "short" into 4 encoded nibbles was best. Regarding your pull requests. All the changes look good to me although I was wondering about the intro of the "Match" object (which I commented on). I'm inclined to believe you but for such a change want to know there was something empirical backing it. |
Why does this method encode a four-byte offset when offset is only two bytes?
The text was updated successfully, but these errors were encountered: