Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-source storage backends to fsspec drivers #541

Open
observingClouds opened this issue Nov 22, 2024 · 7 comments
Open

Out-source storage backends to fsspec drivers #541

observingClouds opened this issue Nov 22, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@observingClouds
Copy link

Is your feature request related to a problem? Please describe.

The backends developed within this package seem useful also outside of earthkit. Who doesn't like to have easier access to data and focus more on the fun parts?! 😄 Because it is not always necessary to have e.g. plotting capabilities and install all the additional earthkit dependencies, it would be great to extract these backends into separate packages. This would also further be in line with the ECMWF Software Strategy and Roadmap for 2023–2027, which earthkit's development follows and help the community to

improving reusability and componentisation of software.

Describe the solution you'd like

I propose to develop these backends (ECFS, FDB, MARS,...) as fsspec drivers. The benefit would be:

  • usage of widely adopted driver API
  • interoperability with other python packages
  • support of chaining of protocols, e.g. zip and s3
  • contributing to large open-source community

Further, earthkit could have a general entrypoint for fsspec (which e.g. xarray.open_dataset() is) and get support for all sort of other data sources for free, e.g. S3, zip, tar, webdav, ftp, Databricks, DVC, git, memory, cache and many, many, many more. Though the package name includes "filesystem" it fully supports also object stores and with memory and cache drivers also in-memory streaming.

I have already implemented a driver for ECFS that helps to abstract the system commands and allows to access ECFS resources directly via a URI.

Some of the current issues strengthen this idea as they would benefit from an fsspec centered implementation:

Having also the ECMWF related protocols implemented as fsspec drivers would be amazing.

@tlmquintino as we were recently touching this topic

Describe alternatives you've considered

No response

Additional context

No response

Organisation

DMI

@observingClouds observingClouds added the enhancement New feature or request label Nov 22, 2024
@tlmquintino
Copy link
Member

Because it is not always necessary to have e.g. plotting capabilities and install all the additional earthkit dependencies, it would be great to extract these backends into separate packages

Regarding this point, that is the reason why earthkit is broken into components. Loading and converting data is the responsibility of earthkit-data. You should not need to install the plotting components if you dont want plotting. We are trying to minimise the dependencies.

A lot of the earthkit-data dependencies are related to its data sources and its data conversions.

@tlmquintino
Copy link
Member

tlmquintino commented Nov 22, 2024

The backends developed within this package seem useful also outside of earthkit.

Regarding this point, please note that most of the backends for loading from data sources, already exist as lower level packages which bring their own clients and protocols (FDB, MARS, CDS, etc)

@tlmquintino
Copy link
Member

Irrespective of the 2 comments above, a backend based on fsspec is very much welcome, which can co-exist with the current and other future backends.

@tlmquintino
Copy link
Member

I would suggest that starting with an implementation of the ECFS backend could be a good first step?

@observingClouds
Copy link
Author

observingClouds commented Nov 22, 2024

I would suggest that starting with an implementation of the ECFS backend could be a good first step?

That I have developed at https://github.com/observingClouds/ecmwfspec, would be great to see more like these for FDB and MARS 😉

@observingClouds
Copy link
Author

Regarding this point, please note that most of the backends for loading from data sources, already exist as lower level packages which bring their own clients and protocols (FDB, MARS, CDS, etc)

I guess what I am saying is that it would be great to implement an fsspec entrypoint in those lower level packages (or earthkit-data if the lower packages miss some fundamental functionality).

In case of FDB for example, the entry point could be implemented based on https://github.com/ecmwf/earthkit-data/blob/develop/src/earthkit/data/sources/fdb.py or https://github.com/ecmwf/pyfdb such that the following works:

import xarray
ds = xr.open_dataset("fdb://domain=g&stream=oper&levtype=pl&levelist=300&date=20191110&time=0000&step=0&param=138&class=rd&type=an&expver=xxxx")

@tlmquintino
Copy link
Member

Regarding this point, please note that most of the backends for loading from data sources, already exist as lower level packages which bring their own clients and protocols (FDB, MARS, CDS, etc)

I guess what I am saying is that it would be great to implement an fsspec entrypoint in those lower level packages (or earthkit-data if the lower packages miss some fundamental functionality).

In case of FDB for example, the entry point could be implemented based on https://github.com/ecmwf/earthkit-data/blob/develop/src/earthkit/data/sources/fdb.py or https://github.com/ecmwf/pyfdb such that the following works:

import xarray
ds = xr.open_dataset("fdb://domain=g&stream=oper&levtype=pl&levelist=300&date=20191110&time=0000&step=0&param=138&class=rd&type=an&expver=xxxx")

This is already implemented with earthkit-data, reading from the FDB and then using the to_xarray() method, so I guess doesnt make sense implementing it there, since earthkit-data concepts are not of a filesystem.

If you want to avoid using earthkit-data and go via the fsspec then I suppose the place would be to implement it at the lower level, likely in pyfdb. Mapping the FDB to a filesystem path-like structure will be an interesting challenge.
Feel welcome to open an issue and provide a PR there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants