Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the python bitinfo to numcodecs #257

Open
thodson-usgs opened this issue Jan 30, 2024 · 15 comments
Open

Adding the python bitinfo to numcodecs #257

thodson-usgs opened this issue Jan 30, 2024 · 15 comments

Comments

@thodson-usgs
Copy link

thodson-usgs commented Jan 30, 2024

I've been looking into modifying the numcodecs bitround codec to accept a user defined function to determine the number of bits to round. namely xbitinfo.bitinformation. This would streamline the process of chunk-wise bitrounding to something like

from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)

filters = [BitRound(custom_function())]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

where custom_function would essentially wrap get_bitinformation and get_keepbits. All in all, that may not offer much over your current chunk-wise approach, except enabling us to use bitinformation in pangeo-forge compression and rechunking pipelines.

Alternatively, we could add some stripped down bitinfo implementation to numcodec and avoid the need for custom_function(). I'd be happy to help with that, but I don't want to advance that without permission from the xbitinfo team. Both projects have MIT license.

@milankl
Copy link
Collaborator

milankl commented Jan 30, 2024

Sounds great to me, I let Hauke comment though.

@thodson-usgs thodson-usgs changed the title Adding the python bitinfo to numcodec Adding the python bitinfo to numcodecs Jan 30, 2024
@observingClouds
Copy link
Owner

Hi @thodson-usgs,
Thanks for reaching out! I like your proposal. The challenges I can see and which also emerge from your issue at numcodecs is that we would like to have as few implementations as possible to keep the algorithms maintainable as we develop them further. I therefore favour your custom_function suggestion over a new implementation into numcodecs. It should also be noted also that the custom_function needs only to be applied in the writing process and not in the reading process. The filters argument in the output file could therefore remain the same. In this case the user would not need to have xbitinfo or any other dependencies installed that provide custom_function.

@thodson-usgs
Copy link
Author

thodson-usgs commented Jan 30, 2024

Exactly right, of course. I'll proceed with custom_function for now and see where we get.

Along that tack, a possible outcome might be to implement such a function in xbitinfo, or at the very least add an example construction to numcodecs doc.

filters = [BitRound(xbitinfo.helper_function)]

@thodson-usgs
Copy link
Author

zarr-developers/numcodecs#503 (comment)
Unfortunately, I'm not sure this will work.

I'm by no means the expert, but as I look through the respective code bases, a reimplementation might be the simplest option, and could refer users to xbitinfo for the lastest and greatest version. Even if passing custom_function did work, xbitinfo's whole implementation is really built to leverage xarray. In general, I like that design choice. However, for this purpose, I'm finding myself trying to unravel some of that into a simpler function that operates on arrays.
I didn't double check, but the only dependency might be numpy.

@thodson-usgs
Copy link
Author

thodson-usgs commented Jan 31, 2024

...true, this might reduce direct traffic to xbitinfo, but I also think it's a valuable step into mainstream adoption.

...or numcodecs might accept xbitinfo as an optional dependency, allowing us to import xbitinfo within the bitround codec. I haven't advanced that idea yet.

@thodson-usgs
Copy link
Author

@martindurant, I'm bringing you into this discussion with the xbitinfo team before continuing on the numcodecs side.

The two options I'd like both camps to mull over are:

  1. Reimplementing a basic bitinfo algorithm in numcodecs. I think the only dependency is numpy.
  2. Having xbitinfo as an optional dependency in the bitround codec.

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 2, 2024

Rather than including this in bitround, lets create a bitinfo codec that calls bitround.

Here's my draft proposal:
https://github.com/thodson-usgs/numcodecs/blob/adaptive-bitrounding/numcodecs/bitinfo.py

(ah, still a couple bugs here, but you'll get the gist)

@thodson-usgs
Copy link
Author

Okay works now. Going for multithread.

@martindurant
Copy link

Going for multithread.

Zarr is commonly used with dask, so it should be enough to have your algorithms not hold the GIL

@observingClouds
Copy link
Owner

Thanks @thodson-usgs for all your work! If I understand it correctly you copy/reimplement part of the xbitinfo package to add to a numcodec algorithm. Can't we find a different solution, where e.g. the xbitinfo package provides an entry point to numcodecs? This will require xbitinfo to get installed to provide the codec but as mentioned above this should only be necessary for users who write data and not for those that read it and we make it easier for new features to be implemented across the board.

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 2, 2024

I sense that could get complicated, but I defer to the respective maintainers for guidance.

Currently,

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 3, 2024

@observingClouds,
I'm also new to numcodecs, but I think it's designed to prohibit our initial proposal (probably a sign of a good codec).
But you too can join the ranks of zarr-devs, if you can fix this or any other typo :)

def _cdf_from_info_per_bit(info_per_bit):
    """Convert info_per_bit to cumulative distribution function"""
    # I suspect something is wrong with tol
    #tol = info_per_bit[-4:].max() * 1.5
    #info_per_bit[info_per_bit < tol] = 0
    cdf = info_per_bit.cumsum()
    return cdf / cdf[-1]

I think the first objective is to write a good codec. Once that's done, we can assess how and to what extent this gets rolled back into xbitinfo.

@thodson-usgs
Copy link
Author

Maybe register_codec is the route to adding the codec to xbitinfo.
For example, https://github.com/cgohlke/tifffile/blob/master/tifffile/numcodecs.py

@observingClouds
Copy link
Owner

observingClouds commented Feb 5, 2024

That looks indeed promising. Thanks for the pointer! I actually have played with an external BitRounding codec before it got implemented directly in numcodecs. This should be relatively straight forward to add to xbitinfo.

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 5, 2024

I take that back, register_codec might be the wrong approach here. I believe the plan is:

  1. implement a numpy-based codec in numcodec (basically done)
  2. that codec could be migrated to xbitinfo such that the bitinfo codec imports xbitinfo, but we can assess the practicality of that after the codec is setup.

Please let me know if I've misjudged the necessity of a numpy-based codec, otherwise we can jump to 2 and import the dask-xarray version in numcodecs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants