-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question: FSE doesn't always compress better than huf - expected? #103
Comments
This can happen. The difference cannot be "huge", as it's essentially concentrated into the header format : After that, both implementation should provide similar compression ratio, In this case, I see a file with very poor compression ratio. |
Thanks again for the explanation - makes sense! Didn't realize FSE header was nontrivial. |
"(though |
Because, ignoring header and code length limitations, Huffman code is already optimal when all symbol probabilities are powers of two. There is simply no room for arithmetic coding or FSE to improve upon it in such case. The limitation to powers of two comes from the fact that it assigns code words of integral bit lengths to symbols. It cannot have a symbol encoded with 1.74 bits on average. Arithmetic coding and FSE can. |
I can't share the data, but in short, it's 9108363519 bytes (~9GB) of almost-uncompressible data (IIRC it's the 9GB tail of a larger already-compressed stream).
% ./fse -e ./almost-uncompressable
Compressed 9108363519 bytes into 8992064047 bytes ==> 98.72%
% ./fse -h ./almost-uncompressable
Compressed 9108363519 bytes into 8943423537 bytes ==> 98.19%
% ./fse -z ./almost-uncompressable
Compressed 9108363519 bytes into 8944678105 bytes ==> 98.20%
Granted that I don't know the intimate details of FSE and this is a near-pathological case, but I'd have expected the two huffman implementations to fare rather worse than FSE on these almost-but-not-quite-uniform distributions of data.
Am I wrong?
Cheers.
The text was updated successfully, but these errors were encountered: