Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pcodec v0.3 #639

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

slevang
Copy link
Contributor

@slevang slevang commented Nov 14, 2024

Compatibility for the new delta_spec argument and modes in pcodec.ChunkConfig. Maintains backwards compatibility if only delta_encoding_order is passed. Also adds the paging_spec arg.

Closes #623

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.92%. Comparing base (f43449b) to head (94db6a8).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #639   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files          62       62           
  Lines        2721     2750   +29     
=======================================
+ Hits         2719     2748   +29     
  Misses          2        2           
Files with missing lines Coverage Δ
numcodecs/pcodec.py 100.00% <100.00%> (ø)
numcodecs/tests/test_pcodec.py 100.00% <100.00%> (ø)

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

Maybe we should expand the numcodecs API to support future specs?

delta_spec: Literal["auto", "none", "try_consecutive", "try_lookback"] = "auto" # ignored if <v0.3
delta_encoding_order: Optional[int] = None # ignored if >=v0.3 and spec != try_consecutive?

Or accept actual ModeSpec/DeltaSpec objects, with sensible auto defaults?

cc @rabernat @mwlon

@rabernat
Copy link
Contributor

Maybe we should expand the numcodecs API to support future specs?

Could you say more about what specifically you mean here? I don't quite get the context for this comment.

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

@mwlon presumably understands the underlying changes much better than I do. But the gist is pcodec now takes a more flexible DeltaSpec object, rather than a simple int that routes to auto/consecutive modes.

This broke the numcodecs interface on the latest release, so I'm just trying to figure out the right path to support any future additions without breaking existing numcodecs users.

@mwlon
Copy link
Contributor

mwlon commented Nov 14, 2024

Can we just bump pcodec version to >=0.3,<0.4 so we don't need to support both code paths?

You're correct: I added a new type of delta encoding, now handled by delta spec. This should be harder to break API wise in the future.

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

Yeah that makes a lot more sense 👍

@rabernat
Copy link
Contributor

A key consideration here, and an important priority for numcodecs, is backwards compatibility. Ideally any data written by Zarr will be readable for a long time into the future. This means that breaking changes in the codec parameters, which would cause decoding of existing data to fail, should be avoided.

This could be relaxed if we had some sort of versioning system for codecs. But unfortunately we don't.

For any of the proposals above, I would ask these questions:

  • Can we still read old data written with earlier versions of the codec?
  • Is the change future-proof to additional future evolution of the codec parameters?

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

I believe these are purely API changes and feature additions, since the ChunkConfig is only involved in setting up the encoding step. Presumably everything that's needed for decoding is self contained.

Would be good for @mwlon to confirm though. And I do wonder if the new modes (e.g. DeltaSpec.try_lookback, ModeSpec.try_float_quant) can be decoded by older versions of pcodec, although that's much less important.

Is the change future-proof to additional future evolution of the codec parameters?

Seems all of the config API has evolved to these spec objects, so maybe now is a good time to make the numcodecs API match fully:

level: int = 8,
delta_spec: DeltaSpec | None = None,
mode_spec: ModeSpec | None = None,
paging_spec: PagingSpec | None = None,

and the user is responsible for passing through an explicit pcodec object if they want custom behavior.

@mwlon
Copy link
Contributor

mwlon commented Nov 14, 2024

Can we still read old data written with earlier versions of the codec?

Yes! Pcodec will always be able to decode older format versions.

And I think I agree with @slevang 's last comment: I'm in favor of passing the pcodec specs straight through to keep everything simple. Presumably we can convert non-spec kwargs into these (based on type) for numcodec's API backward compatibility. @rabernat do you agree?

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

Presumably we can convert non-spec kwargs into these (based on type) for numcodec's API backward compatibility.

It's entirely possible Ryan and I are the only ones actually using numcodecs.pcodec, and I've never used any options besides level. So API backwards compat is probably not super important here 🙂

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

and the user is responsible for passing through an explicit pcodec object if they want custom behavior

Ah nevermind I think, because numcodecs expects the codec config to be json-serializable which is a requirement of zarr. I guess we'll have to just pile on the kwargs to support additional features.

@rabernat
Copy link
Contributor

It's entirely possible Ryan and I are the only ones actually using numcodecs.pcodec,

Definitely not the case. We have customers with petabytes of Zarr data (now a little bit less! 🙌 ) encoded with pcodec.

@slevang
Copy link
Contributor Author

slevang commented Nov 14, 2024

Ok new approach, keep the string style args but also support delta_spec and paging_spec. It would be possible to add support for the try mode specs now, but wasn't straightforward to integrate with the existing tests since these are dtype specific. So I skipped it.

pyproject.toml Outdated Show resolved Hide resolved
numcodecs/pcodec.py Outdated Show resolved Hide resolved
numcodecs/pcodec.py Outdated Show resolved Hide resolved
numcodecs/pcodec.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mwlon mwlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and I think these configurations will be pretty robust to future changes.

@slevang slevang mentioned this pull request Nov 18, 2024
@mwlon
Copy link
Contributor

mwlon commented Nov 28, 2024

@slevang @rabernat should this be merged?

@rabernat
Copy link
Contributor

Thanks for the ping @mwlon!

This looks good, but I'm a little concerned by the drop in coverage. We strive for 100% coverage in Numcodecs. I realize it's tedious to test all of these different error states. @slevang do you feel comfortable with the uncovered blocks?

@slevang
Copy link
Contributor Author

slevang commented Nov 28, 2024

Pretty sure numcodecs/pcodec.py and this patch are 100% covered. The previous coverage failure was on an unrelated part of the repo. Now the coverage action seems to have stalled completely 🤷

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for this PR, and sorry for taking so long to review.

This looks great overall. I've left some questions inline, and in addition to those please could you add a changelog that outlines:

  • That pcodec 0.3 is now supported
  • What this means for forward and backward comaptibility with data compressed using older versions of pcodec and numcodecs
  • That the order of arguments in PCodec.__init__ has changed

numcodecs/pcodec.py Show resolved Hide resolved
numcodecs/pcodec.py Outdated Show resolved Hide resolved
numcodecs/tests/test_pcodec.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compatibility with pcodec 0.3
4 participants