Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to load different EUREC4A intake catalogs #1

Open
observingClouds opened this issue Mar 27, 2022 · 1 comment
Open

Add option to load different EUREC4A intake catalogs #1

observingClouds opened this issue Mar 27, 2022 · 1 comment

Comments

@observingClouds
Copy link

In some cases it would be helpful to be able to load a particular intake catalog of EUREC4A data. get_intake_catalog is currently allowing to select different catalogs based on their CID when served via IPFS. I'd like to propose a similar option also for the file system referenced catalogs. Currently only the catalog in the master branch of the eurec4a repository can be loaded.

Possibly this could be written as a new function called e.g. open_intake_catalog or integrated as an argument to get_intake_catalog. This would add the possibility to switch to other EUREC4A intake catalogs ( different fork/ branch/filesystem) which might be under development or contain e.g. references to a local HPC file system.

open_intake_catalog could be as simple as:

def open_intake_catalog(catalog):
   return intake.open_catalog(catalog)

get_intake_catalog could be rewritten to

def get_intake_catalog(use_ipfs=False):
    """
    Open the intake data catalog.
    The catalog provides access to public EUREC4A datasets without the need to
    manually specify URLs to the individual datasets.
    """
    if use_ipfs:
        if isinstance(use_ipfs, str):
            cid = use_ipfs
        else:
            cid = get_cids()['intake']['latest']
        return open_intake_catalog(f"ipfs://{cid}/catalog.yml")
    else:
        return open_intake_catalog("https://raw.githubusercontent.com/eurec4a/eurec4a-intake/master/catalog.yml")

to reduce redundancy.

Of course it would also be an option to just load different catalogs directly via intake without using this package in those cases.

@d70-t
Copy link
Contributor

d70-t commented Mar 28, 2022

I have to admit, that I'm a bit reserved with respect to this proposal. Maybe others should chime in and add more opinions. Here's a bit about how we arrived at the current state:

The initial idea of get_intake_catalog has been to have no arguments at all (it should just return the "best available catalog"). It also started out as kind of a work-around to have the hard-coded URL to github in a certral place with the option to update it if needed.

This got a little washed out by adding the option use_ipfs which initially was only False or True, but that's maybe still reasonable to do. Even in this case, the function does a bit of (non-trivial) work, namely to fetch the latest CID from github.

The latest update (the possibility to give an actual CID) is arguably too much: we could and maybe should just advise people to do intake.open_catalog(f"ipfs://{cid}/catalog.yml") themselves 🤷‍♂️. In particular, as it's possible to put in arbitrary CIDs, it's now possible to open non-eurec4a intake catalogs with eurec4a.get_intake_catalog, which probably should then be called intake.open_catalog instead... However, if you think of a CID as a version instead of as a path, it might still be somewhat reasonable 🤔 .

Anyways, my main cocern is, that I don't really see an advantage of using eurec4a.open_intake_catalog instead of the proposed intake.open_catalog, where I do see an advantage of using eurec4a.get_intake_catalog(True|False). CIDs are somewhere in-between.

I'm also a bit concerned about referencing data which is local to an HPC system in something which is somehow labeled an "eurec4a" intake catalog, as this obviously goes agains the purpose of having a globally accessible catalog.

To move this forward: how could potential future implementations of eurec4a.open_intake_catalog could look like, which would be a reason to establish this method now (in stead of recommending the use of intake.open_catalog)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants