Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: FSE doesn't always compress better than huf - expected? #103

Open
adamdmoss opened this issue Jul 17, 2020 · 4 comments
Open

question: FSE doesn't always compress better than huf - expected? #103

adamdmoss opened this issue Jul 17, 2020 · 4 comments
Labels

Comments

@adamdmoss
Copy link

I can't share the data, but in short, it's 9108363519 bytes (~9GB) of almost-uncompressible data (IIRC it's the 9GB tail of a larger already-compressed stream).

% ./fse -e ./almost-uncompressable
Compressed 9108363519 bytes into 8992064047 bytes ==> 98.72%
% ./fse -h ./almost-uncompressable
Compressed 9108363519 bytes into 8943423537 bytes ==> 98.19%
% ./fse -z ./almost-uncompressable
Compressed 9108363519 bytes into 8944678105 bytes ==> 98.20%

Granted that I don't know the intimate details of FSE and this is a near-pathological case, but I'd have expected the two huffman implementations to fare rather worse than FSE on these almost-but-not-quite-uniform distributions of data.

Am I wrong?

Cheers.

@adamdmoss adamdmoss changed the title question: FSE doesn't always compress better - expected? question: FSE doesn't always compress better than huf - expected? Jul 17, 2020
@Cyan4973
Copy link
Owner

This can happen.

The difference cannot be "huge", as it's essentially concentrated into the header format : fse needs more information than huffman, so its header is larger.

After that, both implementation should provide similar compression ratio,
except when one or more symbols start to feature a "dominant" probability (>10 %),
in which case, fse starts to produce some benefits,
larger as the most dominant symbols takes a larger share
(though huffman nullify this advantage if dominant symbols represent exactly 25% or 50%).

In this case, I see a file with very poor compression ratio.
It follows that none of the symbols (byte values) is dominant.
In which case, huffman and fse will produce similar compression ratios,
with fse being slightly disadvantage by the heavier header.

@adamdmoss
Copy link
Author

Thanks again for the explanation - makes sense! Didn't realize FSE header was nontrivial.

@ColdCodeCool
Copy link

This can happen.

The difference cannot be "huge", as it's essentially concentrated into the header format : fse needs more information than huffman, so its header is larger.

After that, both implementation should provide similar compression ratio, except when one or more symbols start to feature a "dominant" probability (>10 %), in which case, fse starts to produce some benefits, larger as the most dominant symbols takes a larger share (though huffman nullify this advantage if dominant symbols represent exactly 25% or 50%).

In this case, I see a file with very poor compression ratio. It follows that none of the symbols (byte values) is dominant. In which case, huffman and fse will produce similar compression ratios, with fse being slightly disadvantage by the heavier header.

"(though huffman nullify this advantage if dominant symbols represent exactly 25% or 50%)" do you mean dominant symbols appear with exact 25% or 50%, huffman and FSE issue the same compression rate? but if it is other percent like 30% or 40%, then FSE shall be better than huffman? why is that? Thank you for advance!

@mwgamera
Copy link

why is that?

Because, ignoring header and code length limitations, Huffman code is already optimal when all symbol probabilities are powers of two. There is simply no room for arithmetic coding or FSE to improve upon it in such case. The limitation to powers of two comes from the fact that it assigns code words of integral bit lengths to symbols. It cannot have a symbol encoded with 1.74 bits on average. Arithmetic coding and FSE can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants