random access example #1

ChinaXing · 2020-09-20T12:25:10Z

hope you can add some example code about random access.

I haven't saw any code snippet about random access and its usage

:)

peterboncz · 2020-09-20T14:00:15Z

Before we write the wrong example code: could you explain what you mean with a random access example?

If you have an array of FSST compressed strings, or rather, a big blob of data with an integer array with offsets into it (or an array of pointers to FSST compressed strings) the. You can clearly randomly access this integer array of offsets (or array of pointers)

turbo · 2021-03-03T20:19:57Z

If you have an array of FSST compressed strings, or rather, a big blob of data with an integer array with offsets into it (or an array of pointers to FSST compressed strings) the. You can clearly randomly access this integer array of offsets (or array of pointers)

Is this the Random Access you talked about in the presentation? Because wouldn't that be the same for any other encoding or compressions method?

FWIW I came to the repo expecting an example of random access, just like @ChinaXing :)

peterboncz · 2021-08-26T08:41:59Z

in a block-based compression method like LZ4, snappy, zstd or gzip, the only way you can get decompressed data out is to decompress a whole block. These blocks also need to be relatively large, like 4KB or preferably more. The reason for that is that these schemes achieve compression by leveraging repetition in the string. If the block were just 20 bytes long, like a single string that occurs in a database table column could be, none of these schemes will achieve any compression, because inside the 20 bytes there is typically little or no useful repetition.

In FSST, you can have many short strings compress efficiently. That is, you can have very many small blocks. In a database file format, each compressed string would be a 'block' by way of comparison with LZ4, snappy, zstd or gzip. These FSST 'blocks' can be a short as a single byte. The compression ratio would be good. And you can pick out any block, and decompress it; that is you can pick out individual strings and not touch the rest.

Of course, you need some way to find the starts of these blocks. In a database file formats, strings are typically stored in two areas: one holding byte-offsets (i.e. pointers to each string), and one holding all the strings (all their characters/bytes, one after the other, concatenated). The byte-offsets are typically further compressed: delta encoded (i.e. you store string lengths) and bit-packed - such that the average space consumption is log2(avg-string-length). Vectorized bit-unpacking and delta-decompression allows then to quickly decompress a sequence of 256 offsets in less than 100 cycles. So in the case that you only need 4% of the strings, so say 10 out of the 256 strings would be relevant, you would get 10 relevant string offsets this way (in less than 100 cycles, so amortized 10 cycles per piece) and then use these offsets to randomly fetch the 10 FSST-compressed strings you need. And leave 246 strings untouched.

When using LZ4, snappy, zstd or gzip in a database format, you will have the same offset (or 'length' given delta compression - it is the same) decompression cost, but you also have to decompress 25x more data for getting to the string characters (25x because in my example you need only 4%). Plus, a smart data system will not immediately even decompress the 10 relevant FSST strings we pick out. Because leaving the strings compressed in hash table further down the query processing is a good idea, because it gives you a smaller hash table.

If you are not asking this question from a data(base) system context, and you don't have this problem of selective access billions of compressed small strings, but maybe you need selective access to much fewer large/huge compressed strings, this is a bit different problem. Not sure what you would be doing. Maybe you do some information retrieval or text search application, and only want to search** part of these compressed documents while decompressing as little as possible. In that case you need to create 'entry points', e.g. chop up the long strings in smaller blocks, and keep a pointer (byte-offset) to the start of each block. FSST already chops up uncompressed long strings in 511-byte chunks (and then compresses each chunk into a compressed byte-sequence that is hopefully significantly shorter than 511 bytes), in order to make its AVX512 compression fast (but it does so always, and regardless the platform). So you could create offsets that point to the start of each 511-byte chunk (in terms of the decompressed string). And then use this offset metadata to jump into the middle of your long compressed string and decompress only part of it (i.e. deliver 511 uncompressed bytes). That way you avoid having to decompress larger blocks of 4KB or more, as you would do with LZ4, snappy, zstd or gzip.

**we have some ongoing work on regular-expression matching directly on FSST-compressed strings that has not been published yet

The way you keep the byte-offsets/pointers to the compressed strings/blocks is system-dependent. This is why there is no example code. In DuckDB, we will be delta-compressing and bit-packing these pointers to also keep these small. That code would be a bit complicated and low-level so it would not necessarily be helpful for understanding. Hopefully this explanation helps.

Kroc · 2023-04-06T21:16:16Z

I'd like to implement FSST decompression on some 8-bit targets such as 6502 & Z80. The format seems particularly well suited to decompression on such hardware where bit-packed Huffman coding is inefficient.

It's not apparent, after compiling the binary, how I would take many short strings and compress them using the same dictionary; or rather, how to, given a collection of strings, to compress them and then know where each string begins in FSST's output. In my case, I need to separate each line of text (40 chars) into its own string so that when scrolling a document, I can decode just the line needed. I obviously want to share the symbol-table/dictionary with the whole document and not reproduce this for every string, no?

If I can't figure this out, I may look at reimplementing the FSST compression algorithm in a language I know such as Python, Lua or whatever. Speed is not a huge concern since I will be dealing with only ~100KB total text any how, but it would be nice to avoid this extra work.

peterboncz · 2023-04-06T21:56:33Z

I do not understand your comment, the API is specifically targeted at compressing many short strings using a single dictionary. Please repeat your question by quoting the API you consider to use and asking the question on the pieces you do not understand.

Kroc · 2023-04-06T22:05:58Z

Apologies, I do not know C/C++; I am not referring to the API but rather the command line tool. It can only take one input. I do not have the knowledge to use the C API; I wish to use the command line tool as part of a build script

peterboncz · 2023-04-06T22:08:11Z

The tool only works with single files. Because you do not know C/C++, and your application does not care about performance, I recommend not using FSST.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random access example #1

random access example #1

ChinaXing commented Sep 20, 2020

peterboncz commented Sep 20, 2020

turbo commented Mar 3, 2021 •

edited

Loading

peterboncz commented Aug 26, 2021 •

edited

Loading

Kroc commented Apr 6, 2023

peterboncz commented Apr 6, 2023

Kroc commented Apr 6, 2023 •

edited

Loading

peterboncz commented Apr 6, 2023

random access example #1

random access example #1

Comments

ChinaXing commented Sep 20, 2020

peterboncz commented Sep 20, 2020

turbo commented Mar 3, 2021 • edited Loading

peterboncz commented Aug 26, 2021 • edited Loading

Kroc commented Apr 6, 2023

peterboncz commented Apr 6, 2023

Kroc commented Apr 6, 2023 • edited Loading

peterboncz commented Apr 6, 2023

turbo commented Mar 3, 2021 •

edited

Loading

peterboncz commented Aug 26, 2021 •

edited

Loading

Kroc commented Apr 6, 2023 •

edited

Loading