Idea: Support zstd computed dictionaries #323

jameswestman · 2024-01-10T03:05:44Z

jameswestman
Jan 10, 2024

PMTiles already supports regular zstd as one of the compression algorithms. However, zstd also supports custom dictionaries, which can help factor out common data when there are a lot of small files, like in a PMTiles file. From their README:

https://github.com/facebook/zstd#the-case-for-small-data-compression

The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms, and reason is, compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no "past" to build upon.

To solve this situation, Zstd offers a training mode, which can be used to tune the algorithm for a selected type of data. Training Zstandard is achieved by providing it with a few samples (one file per sample). The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression. Using this dictionary, the compression ratio achievable on small data improves dramatically.

I tested this using a set of ~2,000 random z12-14 tiles. The tile schema is OpenMapTiles, with a few additional tags that shouldn't significantly affect the results. I was able to achieve an 13% improvement with a dictionary of only 8kb. The dictionary ends up containing mostly layer names, tag names, and common tag values.

Table of compression ratios for each dictionary size

Compression	Dictionary size	Size	Ratio (vs. uncompressed)	Ratio (vs. regular zstd)
raw		1229249	1.000
gzip		957791	0.779
zstd		952206	0.775
zstd	256	945116	0.769	0.993
zstd	512	926480	0.754	0.973
zstd	1024	861471	0.701	0.905
zstd	2048	846154	0.688	0.889
zstd	4096	833134	0.678	0.875
zstd	8192	825730	0.672	0.867
zstd	16384	828602	0.674	0.870
zstd	32768	828161	0.674	0.870
zstd	65536	824510	0.671	0.866
zstd	131072	816892	0.665	0.858
zstd	262144	820655	0.668	0.862

Python script

import requests
import os
import gzip
import zstandard as zstd
import hashlib
import random

session = requests.Session()

os.makedirs('training', exist_ok=True)
os.makedirs('testing', exist_ok=True)

valid_zoom_levels = [12, 13, 14]

# Download random tiles
total_tiles = 1000

i = 0
while True:
    z = random.choice(valid_zoom_levels)
    x = random.randint(0, 2 ** z - 1)
    y = random.randint(0, 2 ** z - 1)
    url = f'https://tiles.maps.jwestman.net/data/streets_v3/{z}/{x}/{y}.pbf'

    response = session.get(url)

    content = response.content
    # Use hashes so we don't count gains from deduplication of identical tiles
    content_hash = hashlib.sha256(content).hexdigest()

    if response.status_code == 200:
        # Save the tile in either training or testing directory
        path = 'training' if i % 2 == 0 else 'testing'

        if not (os.path.exists(f'training/{content_hash}.pbf') or os.path.exists(f'testing/{content_hash}.pbf')):
            with open(f'{path}/{content_hash}.pbf', 'wb') as file:
                file.write(content)

            print(f"Downloaded tile {i + 1}/{total_tiles}")
            i += 1
            if i >= total_tiles:
                break
    elif response.status_code == 204:
        # Skip empty tiles
        print(f"Received 204 response for tile {i + 1}")
    else:
        print(f"Failed to download tile {i + 1}. Error: {response.status_code} - {response.reason}")

# Find the total size of the testing files
testing_total_size = 0
for filename in os.listdir('testing'):
    testing_total_size += os.path.getsize(f'testing/{filename}')
# print(f'raw         : {testing_total_size:10} ({testing_total_size / testing_total_size:.3f})')
print(f'<tr><td>raw</td><td></td><td>{testing_total_size}</td><td>{testing_total_size / testing_total_size:.3f}</td><td></td></tr>')

# Compress the testing files using gzip
gzip_total_size = 0
for filename in os.listdir('testing'):
    with open(f'testing/{filename}', 'rb') as file_in:
        content = gzip.compress(file_in.read())
        gzip_total_size += len(content)
# print(f'gzip        : {gzip_total_size:10} ({gzip_total_size / testing_total_size:.3f})')
print(f'<tr><td>gzip</td><td></td><td>{gzip_total_size}</td><td>{gzip_total_size / testing_total_size:.3f}</td><td></td></tr>')

# Compress the testing files using zstd
zstd_total_size = 0
for filename in os.listdir('testing'):
    with open(f'testing/{filename}', 'rb') as file_in:
        cctx = zstd.ZstdCompressor()
        compressed = cctx.compress(file_in.read())
        zstd_total_size += len(compressed)
# print(f'zstd        : {zstd_total_size:10} ({zstd_total_size / testing_total_size:.3f})')
print(f'<tr><td>zstd</td><td></td><td>{zstd_total_size}</td><td>{zstd_total_size / testing_total_size:.3f}</td><td></td></tr>')

training_samples = []

for filename in os.listdir('training'):
    with open(f'training/{filename}', 'rb') as file_in:
        training_samples.append(file_in.read())

dictionaries = [zstd.train_dictionary(2**i, training_samples) for i in range(8, 20)]

# Compress the testing files using zstd and the custom dictionary
for dictionary in dictionaries:
    zstd_dict_total_size = 0
    for filename in os.listdir('testing'):
        with open(f'testing/{filename}', 'rb') as file_in:
            cctx = zstd.ZstdCompressor(dict_data=dictionary)
            compressed = cctx.compress(file_in.read())
            zstd_dict_total_size += len(compressed)
    # print(f'zstd({size:6}): {zstd_dict_total_size:10} ({zstd_dict_total_size / testing_total_size:.3f}) ({zstd_dict_total_size / zstd_total_size:.3f})')
    print(f'<tr><td>zstd</td><td>{len(dictionary.as_bytes())}</td><td>{zstd_dict_total_size}</td><td>{zstd_dict_total_size / testing_total_size:.3f}</td><td>{zstd_dict_total_size / zstd_total_size:.3f}</td></tr>')

This would require changes to the PMTiles spec, of course. There could be additional header fields that give the offset and length of the custom dictionary, which would be used to decompress tiles. It would be up to the software producing the file to use an appropriate dictionary (it could take one as a command line argument, for example, or train one on a subset of its output).

bdon · 2024-01-10T03:29:31Z

bdon
Jan 10, 2024
Maintainer

(moving to discussions)

It's a great application of zstd dictionaries, thanks for adding the script and results. Right now the killer feature of PMTiles is being able to decode tiles on the browser, which is why we need to use gzip - gzip has JS implementations like fflate we can use as polyfills if browsers don't all have the new DecompressionStream API. There is some discussion in browsers of bringing zstd to DecompressionStream but it is unclear if that would support dictionaries.

If you control both the source archive and the client you could implement this already by storing the zstd dictionary base64-encoded in the PMTiles JSON metadata section, then setting the header TileType to 0/unknown and constructing the zstd decoder yourself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Support zstd computed dictionaries #323

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Idea: Support zstd computed dictionaries #323

jameswestman Jan 10, 2024

Replies: 1 comment

bdon Jan 10, 2024 Maintainer

jameswestman
Jan 10, 2024

bdon
Jan 10, 2024
Maintainer