Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esmvalcore.local._path2facets error when drs path contains elements which are not keys #2502

Open
k-a-webb opened this issue Aug 2, 2024 · 4 comments

Comments

@k-a-webb
Copy link

k-a-webb commented Aug 2, 2024

When drs is a path with both configurable (e.g., {user}) and not configurable (e.g., nc_output) keys,  _path2facets indexing of keys and paths fails.

Example:

drs = "{user}/{subdir}/{runid}/data/nc_output/CMIP6/CCCma/CCCma/{dataset}-{runid}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}"
path = Path( "/space/hall5/sitestore/eccc/crd/ccrn/users/rvs001/canesm_runs/sv-canam-001/data/nc_output/CMIP6/CCCma/CCCma/CanESM5-1-sv-canam-001/amip/r1i1p1f1/Amon/tas/gn/v20190429/tas_Amon_CanESM5-1-sv-canam-001_amip_r1i1p1f1_gn_200301-200812.nc" )

_path2facets( path, drs ) 

Gives the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[40], [line 24](vscode-notebook-cell:?execution_count=40&line=24)
     [21](vscode-notebook-cell:?execution_count=40&line=21)                 facet1, facet2 = key.split("}-{")
     [22](vscode-notebook-cell:?execution_count=40&line=22)                 facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')
---> [24](vscode-notebook-cell:?execution_count=40&line=24) _path2facets( path, drs )                

Cell In[40], [line 22](vscode-notebook-cell:?execution_count=40&line=22)
     [20](vscode-notebook-cell:?execution_count=40&line=20) if key not in facets:
     [21](vscode-notebook-cell:?execution_count=40&line=21)     facet1, facet2 = key.split("}-{")
---> [22](vscode-notebook-cell:?execution_count=40&line=22)     facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

KeyError: 'dataset'

It incorrectly pairs keys to facets:

> facets {'user': 'CMIP6', 'subdir': 'CCCma', 'runid': 'CCCma', 'exp': 'amip', 'ensemble': 'r1i1p1f1', 'mip': 'Amon', 'short_name': 'tas', 'grid': 'gn', 'version': 'v20190429'}

and then fails when the indexing does not match.

A solution is to record the index of path.split:


def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""

    keys = []
    for key in re.findall(r'{(.*?)}[^-]', f'{drs} '):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)

    key_indices = {}
    for key in keys:
        for i,part in enumerate( drs.split('/')[::-1] ): # iterate backwards
            if '{{{}}}'.format(key) == part:
                key_indices[key] = -(i+1)

    dirpath = path.parents[0] # get directory path

    facets = {}
    for key in key_indices:
        index = key_indices[key]
        if '{' not in dirpath.parts[index]: 
            facets[key] = dirpath.parts[index] # while multiple instances of key could appear, only need it once
     
    if len(facets) != len(keys):
        # Extract hyphen separated facet: {facet1}-{facet2},
        # where facet1 is already known.
        for idx, key in enumerate(keys):
            if key not in facets:
                facet1, facet2 = key.split("}-{")
                facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

    return facets

 

@k-a-webb
Copy link
Author

k-a-webb commented Aug 2, 2024

Noting the my proposed solution fails when keys are intermixed with text... e.g., Tier{tier}:

This works though:

def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""

    keys = []
    for key in re.findall(r'{(.*?)}[^-]', f'{drs} '):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)

    key_indices = {}
    for key in keys:
        for i,part in enumerate( drs.split('/')[::-1] ): 
            if "}-{" in part and "}-{" not in key:  continue
            if '{{{}}}'.format(key) in part:
                key_indices[key] = -(i+1)

    # check that all keys found
    assert all([ key in key_indices for key in keys ]), "Error: Missing keys"

    dirpath = path.parents[0]

    facets = {}
    for key in key_indices:
        index = key_indices[key]
        if '{' not in dirpath.parts[index]: # deal with multi-key parts below
            facets[key] = dirpath.parts[index] 
     
    if len(facets) != len(keys):
        # Extract hyphen separated facet: {facet1}-{facet2},
        # where facet1 is already known.
        for idx, key in enumerate(keys):
            if key not in facets:
                facet1, facet2 = key.split("}-{")
                facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

    return facets

@valeriupredoi
Copy link
Contributor

hi @k-a-webb and many thanks for raising this! Correct me if I'm wrong, but you are trying to map

{user}/{subdir}/{runid}/data

to a path like

/space/hall5/sitestore/eccc/crd/ccrn/users/rvs001/canesm_runs/sv-canam-001/data

but that'll never work since {user} can not be /space/hall5/sitestore/eccc/crd/ccrn/users/rvs001 - the number of placeholders needs to match the number of elements delimited by the path delimiter /

@k-a-webb
Copy link
Author

k-a-webb commented Aug 6, 2024

Hi!

the number of placeholders needs to match the number of elements delimited by the path delimiter /

I can see how this is necessary for _path2facets after investigating the code, but I did not find instructions in the documentation that it was a requirement when setting up config-developer.yml -- although I might missed it!

The code I suggested does not require every element in the path to be a placeholder. This is helpful when there is a set organization to your data (which in this case includes many fixed subdirectory names), but you want to limit the number of parameters required to define a dataset.

@bouweandela
Copy link
Member

@k-a-webb You're welcome to make a pull request to improve this, it looks like you already have some good ideas.

You could probably set rootpath to /space/hall5/sitestore/eccc/crd/ccrn/users to avoid the issue mentioned in #2502 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants