Compression of ADAPT serialized datasets #127

knelson-farmbeltnorth · 2023-11-01T16:31:24Z

Initial discussion in 1 Nov 2023 meeting about if/how ADAPT datasets should be compressed.

Agreement in the meeting that at no time should an ADAPT dataset contain compressed archives within compressed archives, or have an uncompressed adapt.json file with compressed sub files.

The question of how to compress the adapt.json and its consitituent geospatial files was not resolved, however.

Some participants were in favor of ADAPT making no requirement of how entire datasets should be compressed (or not compressed). Other participants suggested we find a compression standard that has wide support and require data be compressed by that and only that.

crutt · 2023-11-13T15:43:46Z

How about .tar.bz2, as it is an open format with wide usage and support, and doesn't support encryption? GZip is also a good candidate instead of BZip, if it is viewed as more available/accessible.

Here's my crack at outlining archive/compression support.

Single File Archive / Compression

Any system that supports the creation of an archive of ADAPT standard data SHALL conform to the Archive Structure, and MUST support the Standard Archive Format (tarball bzip2) as an option.

Systems that generate archives SHALL NOT require encryption or password protection.

Archive Structure

./adapt.json
- The adapt.json file is REQUIRED and MUST be at the root of the archive.
./**/*
- Additional files are OPTIONAL, and SHALL only be included if referenced in the adapt.json file. (ie. geospatial rasters/parquets).

Standard Archive Format

The Standard Archive Format is a tarball bzip2 file with a .tar.bz2 extension.

Tarballs are an open standard for archiving multiple files into a single file, with broad support across operating systems and programming languages.

BZip2 compression is also an open standard with similar support and generally better compression than GZip.

Tar/bz2 support is widely available, and installed by default on many operating systems including macOS and many Linux distributions. On Windows, additional software may be required, such as 7-Zip or WSL.

Creating an archive

tar -cjf archive.tar.bz2 adapt.json ./geospatial/

-c: create a new archive
-j: use bzip2 compression
-f: specify the output file name

Extracting an archive

tar -xjf archive.tar.bz2

-x: extract files from an archive
-j: use bzip2 compression
-f: specify the output file name

knelson-farmbeltnorth · 2023-11-29T17:01:37Z

Agreement in 29 November 2023 to adopt approach above as a recommendation vs. a requirement.

Andreasox · 2024-03-21T08:11:06Z

Hi As I am only an interesting reader of ADAPT, I hijack an earlier thread instead of creating a new one. I note that GDAL has implemented GeoParquet spatial sorting functionality in OSGeo/gdal#9185 which should substantially enhance the read speed of large files. Is this being considered in ADAPT? Best regards Andreas Oxenstierna Dalen Hörbyvägen 53 243 94 Höör 0730-26 97 12

…

On 13 Nov 2023, 16:43 +0100, Chris ***@***.***>, wrote: How about .tar.bz2, as it is an open format with wide usage and support, and doesn't support encryption? GZip is also a good candidate instead of BZip, if it is viewed as more available/accessible. Here's my crack at outlining archive/compression support. Single File Archive / Compression Any system that supports the creation of an archive of ADAPT standard data SHALL conform to the Archive Structure, and MUST support the Standard Archive Format (tarball bzip2) as an option. Systems that generate archives SHALL NOT require encryption or password protection. Archive Structure • ./adapt.json • The adapt.json file is REQUIRED and MUST be at the root of the archive. • ./**/* • Additional files are OPTIONAL, and SHALL only be included if referenced in the adapt.json file. (ie. geospatial rasters/parquets). Standard Archive Format The Standard Archive Format is a tarball bzip2 file with a .tar.bz2 extension. Tarballs are an open standard for archiving multiple files into a single file, with broad support across operating systems and programming languages. BZip2 compression is also an open standard with similar support and generally better compression than GZip. Tar/bz2 support is widely available, and installed by default on many operating systems including macOS and many Linux distributions. On Windows, additional software may be required, such as 7-Zip or WSL. Creating an archive tar -cjf archive.tar.bz2 adapt.json ./geospatial/ -c: create a new archive -j: use bzip2 compression -f: specify the output file name Extracting an archive tar -xjf archive.tar.bz2 -x: extract files from an archive -j: use bzip2 compression -f: specify the output file name — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

knelson-farmbeltnorth · 2024-03-22T15:10:50Z

@Andreasox I suspect your mention of it is the first many of us have heard of it. To date, our decisions have just been that vector data should be stored as GeoParquet, and, for common use cases mapping field coverage, all geometries should be polygons. The definition of all other columns is handled in the json header data, which map to the GeoParquet via column index.

Are you suggesting that we require the bbox column?

Andreasox · 2024-03-23T11:32:43Z

Hi If purely to use as transfer format, then a spatial index is not necessary. But if to be displayed visuallly in for example QGIS, a bbox should substantially enhance the read speed of large files. I would recommend bbox, but not making it mandatory as I assume it is only GDAL/OGR that will make use of it. Note that GDAL/OGR is used ”everywhere” in the GIS sector (including by QGIS) so its GeoParquet support may be widely used over time. If relevant, I can test different performance scenarios if given relevant files. Hälsningar Andreas Oxenstierna Dalen Hörbyvägen 53 243 94 Höör 0730-26 97 12

…

On 22 Mar 2024, 16:11 +0100, Kelly Nelson ***@***.***>, wrote: @Andreasox I suspect your mention of it is the first many of us have heard of it. To date, our decisions have just been that vector data should be stored as GeoParquet, and, for common use cases mapping field coverage, all geometries should be polygons. The definition of all other columns is handled in the json header data, which map to the GeoParquet via column index. Are you suggesting that we require the bbox column? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

knelson-farmbeltnorth · 2024-03-27T19:37:53Z

We discussed in the 27 March 2024 meeting and are not going to require the bounding box data.

knelson-farmbeltnorth changed the title ~~Compression in ADAPT serialized datasets~~ Compression of ADAPT serialized datasets Nov 1, 2023

jim-wilson-kt added this to ADAPT Standard Issue Management Jun 19, 2024

jim-wilson-kt moved this to Implemented in ADAPT Standard Issue Management Jun 19, 2024

knelson-farmbeltnorth moved this from Implemented to Released in ADAPT Standard Issue Management Dec 4, 2024

knelson-farmbeltnorth closed this as completed by moving to Released in ADAPT Standard Issue Management Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression of ADAPT serialized datasets #127

Compression of ADAPT serialized datasets #127

knelson-farmbeltnorth commented Nov 1, 2023

crutt commented Nov 13, 2023

knelson-farmbeltnorth commented Nov 29, 2023

Andreasox commented Mar 21, 2024 via email

knelson-farmbeltnorth commented Mar 22, 2024

Andreasox commented Mar 23, 2024 via email

knelson-farmbeltnorth commented Mar 27, 2024

Compression of ADAPT serialized datasets #127

Compression of ADAPT serialized datasets #127

Comments

knelson-farmbeltnorth commented Nov 1, 2023

crutt commented Nov 13, 2023

Single File Archive / Compression

Archive Structure

Standard Archive Format

Creating an archive

Extracting an archive

knelson-farmbeltnorth commented Nov 29, 2023

Andreasox commented Mar 21, 2024 via email

knelson-farmbeltnorth commented Mar 22, 2024

Andreasox commented Mar 23, 2024 via email

knelson-farmbeltnorth commented Mar 27, 2024