Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform annual tmaxXF to COG and publish STAC metadata for a single NEX-GDDP model #90

Closed
3 of 5 tasks
anayeaye opened this issue Nov 3, 2023 · 7 comments
Closed
3 of 5 tasks
Assignees

Comments

@anayeaye
Copy link
Contributor

anayeaye commented Nov 3, 2023

What

The annual number of days with a maximum temperature greater than 90F has been selected as the pilot Climdex Nex-GDDP dataset for VEDA. This metric is one of the 5 thresholds included in tmaxXF netCDFs. We have a version of this index for each of the 35 NEX-GDDP CMIP6 models with multiple SSPs each. This pilot is to transform and ingest tmaxXF for a single model, not all 35 yet.

Details

  • Raw data in protected veda-uah bucket: s3://cmip6-staging/climdex/tmaxXF/ACCESS-CM2/*.nc
  • Destination pattern: s3://veda-data-store-staging/climdex/tmaxXF/ACCESS-CM2/*.tif (if we do ingest all 35 models we will want this key structure to compare model usage and for browsability) EDIT see update
  • Pilot model: ACCESS-CM2
  • Pilot scenarios (based on shared socioeconomic pathways): SSP126, SSP245, SSP370, SSP585, (& historical)
  • Climdex.org

Transformation notes

  • explode the netCDF variables to a single COG per threshold. The 90F threshold is the priority for this Climdex, start with that and add the others time permitting.
  • these data have already been aligned to -180 to 180
  • pixel reference has been corrected to pixel as area (the expected upper left reference cell description for a tif versus referring to the center of the grid cell)
  • these data do need to be flipped when transformed to COG, though
  • choose rasterio's COG deflate profile and use a predictor (check that output file size is properly compressed)

STAC notes

  • Refer to CMIP6 STAC extension and implement where straight forward. This does not need to be perfect or complete, we can learn from our choices before ingesting more Climdex and models.
  • Titles, variable names, and units are all described in the NetCDF
  • Consider 4 collections for this pilot. While this flat structure will not scale for nex-gddp it could make it easier for us to fast track Climdex for the dashboard and provide an experience similar to the old CMIP6 dashboard
  • Proposed 4 Collections
    • tmaxxf-access-cm2-ssp126
      • -ssp245, -ssp370, -ssp585
  • Proposed items will have one asset per each of the 5 thresholds
  • Consider padding each collection with duplicate item records for the 65 years of historical data for the model (this would make a time series of 1950 to 2100 possible

AC

  • netCDFs 'exploded' to single band yearly COGs (1 per threshold EDIT: starting with 90F threshold only)
  • stac metadata generated and ingested
  • collection definition(s) stored in veda-data
  • transformation code and metadata generation stored in veda-data or stactools a branch on either project with a notebook is fine
    - [ ] BONUS if this is cake, consider repeating for one more model for the pilot
  • Coordinate veda-config (usually data services end with STAC metadata but for this rush delivery we should ensure that the data make it to the dashboard so update the mdx or make sure there is clear information for a hand off)
@anayeaye
Copy link
Contributor Author

anayeaye commented Nov 7, 2023

UPDATE: Streamlined Plan

4 collections

One collection for each SSP

  • climdex-tmaxxf-access-cm2-ssp126
  • climdex-tmaxxf-access-cm2-ssp245
  • climdex-tmaxxf-access-cm2-ssp370
  • climdex-tmaxxf-access-cm2-ssp585
  • historical was not requested so we will not use it for stage 1

Items within these collections:

  • each item will have a single asset named tmax_above_90 (later we may add other thresholds as assets but not for Nov 17)
  • example asset href: s3://veda-data-store-staging/climdex-tmaxxf-access-cm2-ssp126/tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_<year>.tif (_compressed.nc is replaced with _tmax_above_90_<year>.tif)
  • 86 items will be created for each year in the ssp

Ingest plan

  1. Publish transformed COGs to `veda-data-store-staging//
  2. Publish the 4 collections
  3. If we set things up this way we should be able to use airflow pipelines to generate item and insert metadata, confirm that we can use start/end datetime as expected and any other common properties we need

@SwordSaintLancelot SwordSaintLancelot self-assigned this Nov 8, 2023
@anayeaye
Copy link
Contributor Author

anayeaye commented Nov 8, 2023

@SwordSaintLancelot I had a look at the first outputs in s3://climatedashboard-data/climdex/tmaxXF/ACCESS-CM2/ and they look good. I have a couple requests for the files before we publish the objects in veda-data-store-staging

Suggested changes

  1. Use DEFLATE instead of LZW compression (as in: da.rio.to_raster("<outname>.tif", driver="COG", compress=compress))
  2. Filename adjustment, new pattern instead of tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2015.tif, use tmaxXF-ACCESS-CM2-ssp126_2015_tmax_above_90.tif. As in put the year before the netcdf variable name <netcdf-basename>_<YYYY>_<VARIABLE_NAME>.tif. I think this will make it easier to generate multi asset STAC items: for the 86 years in the source file with basename tmaxXF-ACCESS-CM2-ssp126_compressed.nc we will want to generate a STAC items with ids 'tmaxXF-ACCESS-CM2-ssp126_<YYYY>

Object publication

After those adjustments I think we are good to publish the objects for the 4 collections to veda-data-store-staging as s3://veda-data-store-staging/<collection-id>/<filename.tif>. For this pilot work I think we should just use a simple collection-id/files path instead of copying the complex storage structure that was in the original request (for the sake of making airflow ingests easy--does that sound right @ividito?). As in:

s3://veda-data-store-staging/climdex-tmaxxf-access-cm2-ssp126/
     tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2015.tif
     tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2016.tif

Sample nc2cog transformation code

import s3fs 
import xarray as xr

# Open NetCDF with s3fs and read to xarray using h5netcdf engine
fs = s3fs.S3FileSystem()

VARIABLE_NAME = "tmax_above_90"
aws_url = "s3://cmip6-staging/climdex/tmaxXF/ACCESS-CM2/tmaxXF-ACCESS-CM2-ssp126_compressed.nc"

fileObj = fs.open(aws_url)
ds = xr.open_dataset(fileObj, engine="h5netcdf")
da= ds[VARIABLE_NAME].isel(time=0)

# Add crs and set spatial dims if needed
if not da.rio.crs:
    da.rio.write_crs("epsg:4326", inplace=True)
    
# Flip and set spatial dimensions
da = da.reindex(lat=list(reversed(da.lat)))
da.rio.set_spatial_dims("lon", "lat")

# Cloud optimize and generate raster
driver = "COG"
compress = "DEFLATE"
da.rio.to_raster("test_compressed.tif", driver=driver, compress=compress)

@slesaad
Copy link
Member

slesaad commented Nov 10, 2023

The four collections have been published to staging stac catalog.

Each item has 5 assets for above 86, above 90, above 100, above 110, and above 150.

  • tmax_above_86
  • tmax_above_90
  • tmax_above_100
  • tmax_above_110
  • tmax_above_115

@anayeaye
Copy link
Contributor Author

Config notes (wip)

@anayeaye
Copy link
Contributor Author

anayeaye commented Nov 15, 2023

@j08lue
Copy link
Contributor

j08lue commented Nov 29, 2023

This is complete, right? 🎉

@slesaad
Copy link
Member

slesaad commented Jan 4, 2024

PR for the collection configs - #97
Should now be complete!

@slesaad slesaad closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants