Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enriching warning and handling when compression and decompression is done in the different numcodecs version #654

Open
ehgus opened this issue Nov 21, 2024 · 4 comments

Comments

@ehgus
Copy link
Contributor

ehgus commented Nov 21, 2024

I noticed that data compressed with Zstd with numcodecs=0.14.0 cannot decompressed with numcodecs=0.12.1 even though they look compatible except for the checksum. I successfully decompress the data by manually erasing the checksum argument.

The current error message does not tell that the error is due to the use of different version of numcodecs:

  File "numcodecs\\zstd.pyx", line 210, in numcodecs.zstd.Zstd.__init__
TypeError: __init__() got an unexpected keyword argument 'checksum'

I want them to guarantee compatibility for using lower versions or show a more informative error message.
My idea is to ignore such additional arguments with a warning and give a clue to the error message when decompression fails.
Thank you in advance for your feedback.

@jakirkham
Copy link
Member

Looks like this was introduced in PR: #519

@normanrz do you have thoughts on this issue?

@normanrz
Copy link
Contributor

If I understand correctly, 0.14 adds the checksum metadata which means that 0.12 cannot read it anymore.

I think it could be acceptable that data compressed with a newer version would not be compatible with an older version. Since, we are still in 0.x releases, minor versions can be breaking.

However, we could also write out the checksum metadata only if it deviates from the previous default. Are there other cases like this in numcodecs?

@dstansby
Copy link
Contributor

So I think there's a couple of issues to disentangle here:

  1. API backwards compatibility in numcodecs

It looks like your error message is about trying to using the numcodecs 0.12 API the same way as the numcodecs 0.14 API. Since we're < version 1.0, we still allow ourselves to make breaking changes in this API every minor version. I think this is what your orignal issue is about?

  1. Decompression backwards compatibility
    Again, since we're < version 1.0, we don't guarentee that data compressed with a new version of numcodecs will be able to be decompressed with an older version.

  2. Decompression fowards compatibility
    Again (again 😄), since we're < version 1.0, we don't have to guarentee that data compressed with an old version of numcodecs will be able to be decompressed with a newer version. In practice I think we really should and have to be forwards compatible, because so much data in the wild depends on numcodecs.

So I think that unfortunately your error is not something we can fix or improve (because we cannot go back to modify older releases), and unfortunately a consequence of numcodecs not having a stable release yet.

Note that the points above are my interpretation of our compatibility guarentees, so might not be what others agree on. We should agree and write these down in the docs. And numcodecs really needs a stable release, after which we're careful about breaking changes, since it's so widely used now.

@jakirkham
Copy link
Member

Thanks Norman and David for your feedback! 🙏

Am guessing the focus of the question is on reading data that was previously written consistently.

To that end, think Norman's question above is the one we should focus on here

However, we could also write out the checksum metadata only if it deviates from the previous default. Are there other cases like this in numcodecs?

It wouldn't surprise me if this issue had happened before. Though am not remembering one like this one atm

There have been issues between different libraries/languages before (like zlib/gzip compression: #87 & #169 ). This in turn motivated the following body of work to test reading/writing data between different implementations ( https://github.com/zarr-developers/zarr_implementations ), which in turn led to spec work

That said, think Norman is asking the right question. If we simply leave out the config flag when it is unneeded, it should remain readable with old versions. What are the tradeoffs of doing or not doing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants