Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packing & compression #28

Open
mo-marqh opened this issue May 29, 2024 · 6 comments
Open

packing & compression #28

mo-marqh opened this issue May 29, 2024 · 6 comments
Assignees

Comments

@mo-marqh
Copy link
Member

mo-marqh commented May 29, 2024

Investigate opportunities and costs of lossy packing and lossless compression of data.

  • Packing: reducing the precision of the data to quantity-specific levels of precision
    • lossy - by definition
  • Compression: algorithmically processing the data to store a compression representation on storage
    • lossless - by expectation

This ticket is targeting the exploration of the capabilities of XIOS+netCDF+HDF to pack and compress data.

  • Packing Techniques:
    • type coercion to integer with scale_factor & add_offset
    • quantization
  • Compression Algorithms
    • GNU compression with gzip / zlib
    • ZStandard compression with zstd
  • Parallelisation
    • client
    • server (needs a fairly recent version of parallel netCDF+HDF)
@mo-marqh mo-marqh self-assigned this May 29, 2024
@mo-marqh
Copy link
Member Author

XIOS & netCDF support lossless gzip compression.

however, this is only available in more recent netCDF + hdf stacks for parallel write, so some older platforms with old versions of netCDF will not be able to provide gzip compression and parallel writing at the same time

platforms using netcdf 4.8.x and above can support parallel compression

@mo-marqh
Copy link
Member Author

chunking of HDF data
https://portal.hdfgroup.org/documentation/hdf5-docs/advanced_topics/chunking_in_hdf5.html
is important to consider alongside compression, as compression algorithms will run on chunks of data, so how data is segregated within a file has significant influence on the computational impact and storage reduction of files

@mo-marqh
Copy link
Member Author

there is a demonstration programme using scale_factor & add_offset to pack data using XIOS, to a defined precision
https://github.com/MetOffice/tcd-XIOS-demonstration/tree/main/xios_examples/packing_scale_offset

deciding on the precision control values is a significant effort, requiring computer science and atmospheric science input in order to obtain a computationally and scientifically valid packing

mistakes can lead to numerical error, which are problematic within libraries

@mo-marqh
Copy link
Member Author

#27

explores an issue with handling missing data, which is unresolved at this time

@mo-marqh
Copy link
Member Author

https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements
https://docs.unidata.ucar.edu/netcdf-c/current/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_quantize.html
https://www.hdfgroup.org/wp-content/uploads/2021/10/HDF5-Users-Workshop-2021-Additional-Compression-Methods-for-NetCDF.pdf

NetCDF Quantization is an interesting approach to packing, with subtle implementation differences compared to the more traditional type changing

work is required to explore how this facility could interact with XIOS

@mo-marqh
Copy link
Member Author

#27 has now been updated and merged

there is functionality within XIOS2 & XIOS3 to enable XIOS to handle the type conversion of data, and to intercept out of range values and set these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant