-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding metadata bloat caused by many long URLs #13
Comments
Note that the current implementation in fsspec allows for a fallback default URL, for any entry where a URL is not specified, i.e., "[null, start, size]" (or |
@martindurant , I did not know that. How does one specify the fallback default URL? |
argument |
In our example metadata file (s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json) replacing the explicit URLs below with "$U" reduces the size by 50%.
|
Indeed, we might not use JSON as the only storage format in the long run. A binary format in which we can store the columns separately would give very nice compression for the case of repeated values (in the case of parquet, there is a specific encoding for this, although it is rarely used and poorly supported). Note that the key names also have a lot of repetition. All that being said, I think we always want to support JSON too, so it's still worthwhile to come up with a way to declare global variables for the purpose of brevity and compactness. I slightly prefer "{U}" to "$U", since "$" is more likely to be a genuine character in a path, and {..} represents formatting in both python and JS (as opposed to this thing called "bash"). It would also potentially allow for more complicated formatting, so long as it is generalisable beyond python. |
@martindurant I agree with everything you say:
|
I'm curious if Arrow would be of interest/suitable binary format. Mostly asking because I work on zarr-related things in the browser where support for something like parquet is lacking. |
What do you mean? Arrow is not a storage format. In the context of the usage case of this repo, zarr would be a nice option maybe (it does have a JS implementation). |
I don't actually!
Apologies, I should be more clear. It is my understanding that Arrow defines a "language-independent columnar memory format for flat and hierarchical data". In an ecosystem like python, this usually means you import pyarrow as pa
df = pd.DataFrame(/* */)
table = pa.Table.from_pandas(df)
with pa.RecordBatchFileWriter('dataset.arrow', table.schema) as writer:
writer.write(table) # writes in-memory table to disk
# EDIT: as of Apache Arrow 0.17.0, the same data is written to disk using pyarrow.feather,
# since uncompressed Feather V2 is the Arrow representation on disk.
from pyarrow.feather import feather
feather.write_feather(df, 'dataset.feather', compression='uncompressed') There is good support for reading the arrow tables in the browser, whether dynamically from a server or as as a static asset from an object store or web-server. If you look at the documentation for the JavaScript implementation of Arrow, you will see this exact use case: https://arrow.apache.org/docs/js/. In JS we usually deal with our data in JSON, but Arrow is becoming increasingly popular because:
Additional reading:
TL;DR - Arrow outside of Python ecosystem is used as a format and not just a transport layer. It's multi-language support makes it one of the most compelling binary targets for storing this type of metadata that meant to be accessed from many clients. In addition, if you think about it, JSON is just a common transport layer and we should probably be looking for a binary equivalent. (What I was proposing was essentially an Arrow Table (probably compressed with something like gzip) for a binary reference format.) |
I would refer to this as arrow IPC representation, i.e., meant for on-the-line data transfer, but I see what you mean. I don't know much about its performance characteristics - but "faster than JSON" isn't an amazing recommendation (JSON for readability, not speed). The main draw would be to be able to get a tabular representation in memory, rather than objects. This sort of suits the current spec, but not that well... Your actual comparisons in this space would be msgpack, thrift, avro, protobuf... which are all compact, binary, and support complex (but well-defined) structures like this. Indeed, they come with schema definitions directly, so our discussion on the spec could happen in code. Most of them are not normally considered for on-disk storage (except avro?), even though all have good multi-language support. |
For the one example at "s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json", the JSON is 437,597 bytes and the equivalent msgpack 369,478, so not much difference (compresses to 68k, 60k with gzip, respectively) because of all of the repeated strings. I did not try others, which would require me to craft a schema. |
Thanks for the clarification. I guess I'm fairly naive here, and speaking mostly from how I've observed the adoption of Arrow on the web for columnar data (which often feels like it's used as a storage format).
Agreed. I brought up Arrow because of the mention of parquet, and how the current spec could probably be expressed in columnar form.
Fair. I primarily wanted to voice consideration for a binary target that has multi-language support. (Not to say it wouldn't be a consideration!) Context: I work on the js implementation of zarr. |
To be sure, I have a bias against arrow, in that the installation in python is enormous, and in many zarr/xarray or other array-based workloads would only be doing this one job.
So exactly, zarr could be the target, I suspect it's much smaller in code than arrow and/or parquet - but less well known, of course. You can also store arrow in zarr zarr-developers/zarr-python#515 |
I think this is a really sensible concern and something I was not aware of. Just for the sake of completeness of this thread, and because @kylebarron pointed it out to me, apparently Feather V2 storage format is the Arrow IPC representation serialized to disk: https://arrow.apache.org/docs/python/feather.html
I think the concern with adding |
I think the potential for this ecosystem of fsspec / kerchunk / zarr is pretty great in handling large datasets in their original format. I have been using tiff_to_zarr on large datasets of thousands of tiff files, and the memory issue here is pretty significant. I have been circumventing it for now by implementing a custom implementation of ReferenceFileSystem where the array chunk ranges are stored in memory in a pandas dataframe. This significantly reduces the memory footprint. This is sufficient for the POC I am doing, but a more thought out solution would be welcome. Some specific elements to consider:
|
ReferenceFileSystem allows you an optional argument Finally, we have found that compressing the JSON with zstd does a really nice job of compacting the repeated strings on disk, so that the methods above become unnecessary. I would be interested to know what you "tree" implementation is like. In an upcoming version of kerchunk/referenceFS, we will be implementing parquet storage for references, which should help with both the on-disk and in-memory footprint, but might still have a lot of string duplication after load (this remains to be seen). |
I woke up this morning wondering whether it would be possible to allow a variable to be defined in the referencefile spec. I'm worried about long s3 (or other) urls bloating the metadata.
If we could do something like:
we could make the bloat a lot smaller
The text was updated successfully, but these errors were encountered: