Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem building dataset from NetCDF files #7

Open
jbanomedina opened this issue Jul 16, 2024 · 2 comments
Open

Problem building dataset from NetCDF files #7

jbanomedina opened this issue Jul 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jbanomedina
Copy link

What happened?

My goal is to build a dataset from NetCDF files using the anemoi-datasets library. However, I get an error when using NetCDF files as the source. I have tried both version 0.4.0 (installed using pip) and the develop branch (installed by cloning the repository). I was able to successfully build a dataset from a grib file, however for my project I have the data on the NetCDF format.

What are the steps to reproduce the bug?

Code needed to reproduce this error is the following.
1) First, I download a sample NetCDF file from the CDS using a python script.

import cdsapi
## Define parameters
vars=['10m_u_component_of_wind', '10m_v_component_of_wind']
year=2013
###
c=cdsapi.Client()
c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'format': 'netcdf',
        'variable': vars,
        'year': year,
        'month': [
            '01',
        ],
        'day': [
            '01', '02',
        ],
        'time': [
            '00:00', '06:00', '12:00', '18:00',
        ],
    },
    './sample.nc')

2) Second, I point to this sample in the recipe.yaml file.

dates:
  start: 2013-01-01T00:00:00
  end: 2013-01-01T06:00:00
  frequency: 6h
input:
  netcdf:
    path: ./sample.nc
    param: [u10,v10] # I tried also [10u,10v] 
    levtype: sfc
  1. Type this in the command line:
anemoi-datasets create recipe.yaml dataset.zarr

Version

v0.4.0

Platform (OS and architecture)

Linux exp-18-17 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

Setting flatten_grid=True in config
Setting ensemble_dimension=2 in config
Setting flatten_grid=True in config
Setting ensemble_dimension=2 in config
2024-07-16 14:42:59 INFO {'start': datetime.datetime(2013, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'end': datetime.datetime(2013, 1, 1, 6, 0, tzinfo=datetime.timezone.utc), 'frequency': '6h', 'group_by': 'monthly'}
2024-07-16 14:42:59 INFO <anemoi.datasets.dates.groups.Groups object at 0x155147fbcee0>
2024-07-16 14:42:59 INFO ✅ INPUT_BUILDER
2024-07-16 14:42:59 INFO FunctionAction: path=./sample.nc param=['u10', 'v10'] levtype=sfc 
2024-07-16 14:42:59 INFO FunctionAction: path=./sample.nc param=['u10', 'v10'] levtype=sfc 
2024-07-16 14:42:59 INFO Minimal input (using only the first date) :
2024-07-16 14:42:59 INFO netcdf(['2013-01-01T00:00:00'])
Config loaded ok:
2024-07-16 14:42:59 INFO {'config_path': '/expanse/nfs/cw3e/cwp167/projects/test-attribution/recipe.yaml', 'dates': {'start': datetime.datetime(2013, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'end': datetime.datetime(2013, 1, 1, 6, 0, tzinfo=datetime.timezone.utc), 'frequency': '6h', 'group_by': 'monthly'}, 'input': {'netcdf': {'path': './sample.nc', 'param': ['u10', 'v10'], 'levtype': 'sfc'}}, 'dataset_status': 'experimental', 'description': 'No description provided.', 'licence': 'unknown', 'attribution': 'unknown', 'build': {'group_by': 'monthly', 'use_grib_paramid': False, 'variable_naming': 'default'}, 'output': {'order_by': {'valid_datetime': 'ascending', 'param_level': 'ascending', 'number': 'ascending'}, 'remapping': {'param_level': '{param}_{levelist}'}, 'statistics': 'param_level', 'chunking': {'dates': 1, 'ensembles': 1}, 'dtype': 'float32', 'flatten_grid': True, 'ensemble_dimension': 2}, 'statistics': {}, 'reading_chunks': None}
Found 2 datetimes.
2024-07-16 14:42:59 INFO Dates: Found 2 datetimes, in 1 groups: 
2024-07-16 14:42:59 INFO Missing dates: 0
Found 2 datetimes 2.
2024-07-16 14:43:00 INFO Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-07-16 14:43:00 INFO Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-07-16 14:43:00 INFO NumExpr defaulting to 8 threads.
2024-07-16 14:43:00 ERROR Error in execute
Traceback (most recent call last):
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 433, in datasource
    return self.action.function(FunctionContext(self), self.dates, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 72, in execute
    return load_netcdfs("📁", "path", context, dates, path, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 66, in load_netcdfs
    check(what, ds, given_paths, valid_datetime=dates, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 40, in check
    raise ValueError(f"Expected {count} fields, got {len(ds)} (kwargs={kwargs}, {what}s={paths})")
ValueError: Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
Traceback (most recent call last):
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/utils/cli.py", line 128, in cli_main
    cmd.run(args)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/commands/create.py", line 30, in run
    c.create()
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 153, in create
    self.init()
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 50, in init
    obj.initialise(check_name=check_name)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/loaders.py", line 271, in initialise
    variables = self.minimal_input.variables
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 227, in variables
    return self._coords.variables
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 190, in variables
    self._build_coords
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 143, in _build_coords
    from_data = self.owner.get_cube().user_coords
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 350, in get_cube
    ds = self.datasource
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 81, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/template.py", line 82, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/template.py", line 42, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 433, in datasource
    return self.action.function(FunctionContext(self), self.dates, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 72, in execute
    return load_netcdfs("📁", "path", context, dates, path, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 66, in load_netcdfs
    check(what, ds, given_paths, valid_datetime=dates, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 40, in check
    raise ValueError(f"Expected {count} fields, got {len(ds)} (kwargs={kwargs}, {what}s={paths})")
ValueError: Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
2024-07-16 14:43:00 ERROR 
💣 Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
2024-07-16 14:43:00 ERROR 💣 Exiting

Accompanying data

No response

Organisation

No response

@jbanomedina jbanomedina added the bug Something isn't working label Jul 16, 2024
@b8raoult
Copy link
Collaborator

b8raoult commented Oct 4, 2024

Sorry about the delay. We have done a lot of work on NetCDF. Can you try again with the latest version? Also, if you plan to use data from the CDS, I suggest that you download them in grib, so you avoid some unnecessary conversion, and it will be faster.

@jbanomedina
Copy link
Author

Thank you very much for working on this, and for developing this amazing tool. I tried again with the last version, and the previous problem was solved. I am now getting the error below using ERA5, but does not seem critical since the .zarr file obtained seems to be fine, and I am able to open it with Python using the anemoi-datasets library. Could this error probably come from the fact that ERA5 is not a forecast and therefore it does not contain the attribute "forecast_reference_time"?

anemoi-datasets create recipe-era5-test.yaml ${workdir}/data/era5/era5_${yearInit}-01-01.zarr
2024-10-14 13:56:02 INFO Task init((),{}) starting
2024-10-14 13:56:08 INFO Setting flatten_grid=True in config
2024-10-14 13:56:08 INFO Setting ensemble_dimension=2 in config
2024-10-14 13:56:08 INFO Setting flatten_grid=True in config
2024-10-14 13:56:08 INFO Setting ensemble_dimension=2 in config
2024-10-14 13:56:08 INFO {'start': datetime.datetime(2013, 1, 1, 0, 0), 'end': datetime.datetime(2013, 1, 1, 18, 0), 'frequency': '6h', 'group_by': 'monthly'}
2024-10-14 13:56:08 INFO Groups(dates=1)
2024-10-14 13:56:08 INFO FunctionAction: path=./era5_2013-01-01.nc param=['10u'] 
2024-10-14 13:56:11 INFO Minimal input for 'init' step (using only the first date) :
2024-10-14 13:56:11 INFO netcdf(['2013-01-01T00:00:00'])
2024-10-14 13:56:11 INFO Config loaded ok:
2024-10-14 13:56:11 INFO Found 4 datetimes.
2024-10-14 13:56:11 INFO Dates: Found 4 datetimes, in 1 groups: 
2024-10-14 13:56:11 INFO Missing dates: 0
2024-10-14 13:57:22 INFO Found 1 variables : 10u.
2024-10-14 13:57:22 INFO Found 1 ensembles : 0.
2024-10-14 13:57:22 INFO gridpoints size: [1038240, 1038240]
2024-10-14 13:57:22 INFO resolution=None
2024-10-14 13:57:22 INFO total_shape = [4, 1, 1, 1038240]
2024-10-14 13:57:22 INFO chunks=(1, 1, 1, 1038240)
2024-10-14 13:57:22 INFO Creating Dataset './era5_2013-01-01.zarr', with total_shape=[4, 1, 1, 1038240], chunks=(1, 1, 1, 1038240) and dtype='float32'
2024-10-14 13:57:22 ERROR Error in retrieving metadata (cannot build data request info) for XArrayMetadata({'variable': '10u', 'time': '0000', 'date': '20130101', 'step': 0, 'valid_datetime': '2013-01-01T00:00:00'})
Traceback (most recent call last):
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/input.py", line 111, in _data_request
    date = field.datetime()["valid_time"]
           ^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/earthkit/data/core/fieldlist.py", line 512, in datetime
    return self._metadata.datetime()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/earthkit/data/core/metadata.py", line 312, in datetime
    "base_time": self._base_datetime(),
                 ^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/functions/sources/xarray/metadata.py", line 84, in _base_datetime
    return self._field.forecast_reference_time
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/functions/sources/xarray/field.py", line 106, in forecast_reference_time
    return self.owner.forecast_reference_time
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Variable' object has no attribute 'forecast_reference_time'
2024-10-14 13:57:22 WARNING Dataset name error: the dataset name 'era5_2013-01-01' does not follow naming convention. Does not match ^(\w+)-([\w-]+)-(\w+)-(\w+)-(\d\d\d\d)-(\d\d\d\d)-(\d+h)-v(\d+)-?([a-zA-Z0-9-]+)?$
2024-10-14 13:57:24 INFO Number of years 0 < 10, leaving out 20%. end=np.datetime64('2013-01-01T12:00:00')
2024-10-14 13:57:24 INFO Will compute statistics from 2013-01-01T00:00:00 to 2013-01-01T12:00:00
2024-10-14 13:57:24 INFO Task load((),{}) starting
2024-10-14 13:57:24 INFO {'end': '2013-01-01T18:00:00', 'frequency': '6h', 'group_by': 'monthly', 'start': '2013-01-01T00:00:00'}
2024-10-14 13:57:24 INFO Groups(dates=1)
2024-10-14 13:57:24 INFO FunctionAction: param=['10u'] path=./era5_2013-01-01.nc 
Loading 3/4: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.68it/s]
2024-10-14 13:57:28 INFO Name               : /data
Type               : zarr.core.Array
Data type          : float32
Shape              : (4, 1, 1, 1038240)
Chunk shape        : (1, 1, 1, 1038240)
Order              : C
Read-only          : True
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.DirectoryStore
No. bytes          : 16611840 (15.8M)
No. bytes stored   : 13288828 (12.7M)
Storage ratio      : 1.3
Chunks initialized : 4/4

2024-10-14 13:57:28 INFO Task finalise((),{}) starting
2024-10-14 13:57:28 INFO Variables minimum maximum mean stdev has_nans
10u -21.56 22.45 -0.37 5.77 0.00
2024-10-14 13:57:28 INFO Wrote statistics in ./era5_2013-01-01.zarr
Computing size of ./era5_2013-01-01.zarr: 16it [00:00, 4772.02it/s]
2024-10-14 13:57:28 INFO Total size: 12.7 MiB
2024-10-14 13:57:28 INFO Total number of files: 62
2024-10-14 13:57:28 INFO Task patch((),{}) starting
2024-10-14 13:57:28 INFO ✅ Remove _create_yaml_config
2024-10-14 13:57:28 INFO Dataset changed by patch
2024-10-14 13:57:28 INFO Task init_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
2024-10-14 13:57:28 INFO Task run_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
2024-10-14 13:57:28 INFO Task finalise_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
Computing size of ./era5_2013-01-01.zarr: 16it [00:00, 10111.32it/s]
2024-10-14 13:57:28 INFO Total size: 12.7 MiB
2024-10-14 13:57:28 INFO Total number of files: 62
2024-10-14 13:57:28 INFO Task cleanup((),{}) starting
2024-10-14 13:57:28 INFO Task verify((),{}) starting
2024-10-14 13:57:28 INFO Verifying dataset at ./era5_2013-01-01.zarr
2024-10-14 13:57:28 INFO ./era5_2013-01-01.zarr
2024-10-14 13:57:28 INFO Create completed in 1 minute 25 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants