Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression of ADAPT serialized datasets #127

Closed
knelson-farmbeltnorth opened this issue Nov 1, 2023 · 6 comments
Closed

Compression of ADAPT serialized datasets #127

knelson-farmbeltnorth opened this issue Nov 1, 2023 · 6 comments

Comments

@knelson-farmbeltnorth
Copy link
Contributor

Initial discussion in 1 Nov 2023 meeting about if/how ADAPT datasets should be compressed.

Agreement in the meeting that at no time should an ADAPT dataset contain compressed archives within compressed archives, or have an uncompressed adapt.json file with compressed sub files.

The question of how to compress the adapt.json and its consitituent geospatial files was not resolved, however.

Some participants were in favor of ADAPT making no requirement of how entire datasets should be compressed (or not compressed). Other participants suggested we find a compression standard that has wide support and require data be compressed by that and only that.

@knelson-farmbeltnorth knelson-farmbeltnorth changed the title Compression in ADAPT serialized datasets Compression of ADAPT serialized datasets Nov 1, 2023
@crutt
Copy link
Collaborator

crutt commented Nov 13, 2023

How about .tar.bz2, as it is an open format with wide usage and support, and doesn't support encryption? GZip is also a good candidate instead of BZip, if it is viewed as more available/accessible.

Here's my crack at outlining archive/compression support.

Single File Archive / Compression

Any system that supports the creation of an archive of ADAPT standard data SHALL conform to the Archive Structure, and MUST support the Standard Archive Format (tarball bzip2) as an option.

Systems that generate archives SHALL NOT require encryption or password protection.

Archive Structure

  • ./adapt.json
    • The adapt.json file is REQUIRED and MUST be at the root of the archive.
  • ./**/*
    • Additional files are OPTIONAL, and SHALL only be included if referenced in the adapt.json file. (ie. geospatial rasters/parquets).

Standard Archive Format

The Standard Archive Format is a tarball bzip2 file with a .tar.bz2 extension.

Tarballs are an open standard for archiving multiple files into a single file, with broad support across operating systems and programming languages.

BZip2 compression is also an open standard with similar support and generally better compression than GZip.

Tar/bz2 support is widely available, and installed by default on many operating systems including macOS and many Linux distributions. On Windows, additional software may be required, such as 7-Zip or WSL.

Creating an archive

tar -cjf archive.tar.bz2 adapt.json ./geospatial/

-c: create a new archive
-j: use bzip2 compression
-f: specify the output file name

Extracting an archive

tar -xjf archive.tar.bz2

-x: extract files from an archive
-j: use bzip2 compression
-f: specify the output file name

@knelson-farmbeltnorth
Copy link
Contributor Author

Agreement in 29 November 2023 to adopt approach above as a recommendation vs. a requirement.

@Andreasox
Copy link
Collaborator

Andreasox commented Mar 21, 2024 via email

@knelson-farmbeltnorth
Copy link
Contributor Author

@Andreasox I suspect your mention of it is the first many of us have heard of it. To date, our decisions have just been that vector data should be stored as GeoParquet, and, for common use cases mapping field coverage, all geometries should be polygons. The definition of all other columns is handled in the json header data, which map to the GeoParquet via column index.

Are you suggesting that we require the bbox column?

@Andreasox
Copy link
Collaborator

Andreasox commented Mar 23, 2024 via email

@knelson-farmbeltnorth
Copy link
Contributor Author

We discussed in the 27 March 2024 meeting and are not going to require the bounding box data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants