Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate Geoglam & NO2 dataset ingestion #155

Closed
5 tasks done
smohiudd opened this issue Jul 22, 2024 · 8 comments
Closed
5 tasks done

Automate Geoglam & NO2 dataset ingestion #155

smohiudd opened this issue Jul 22, 2024 · 8 comments
Assignees

Comments

@smohiudd
Copy link
Contributor

smohiudd commented Jul 22, 2024

Description

NO2 (#89) and Geoglam (#167, #173) datasets requires monthly ingestion as new assets are created. This is currently a manual process however should be automated. veda-data-airflow has a feature that allows scheduled ingestion by creating dataset specific DAGs. The file must still be transferred to the collection s3 bucket. A json file must be uploaded to the airflow event bucket. Here is an example json:

{
    "collection": "emit-ch4plume-v1",
    "bucket": "lp-prod-protected",
    "prefix": "EMITL2BCH4PLM.001/",
    "filename_regex": ".*.tif$",
    "schedule": "00 05 * * *",
    "assets": {
        "ch4-plume-emissions": {
            "title": "EMIT Methane Point Source Plume Complexes",
            "description": "Methane plume complexes from point source emitters.",
            "regex": ".*.tif$"
        }
    }
}

Acceptance Criteria

  • transfer no2-monthly and no2-monthly-diff from s3://covid-eo-data bucket to s3://veda-data-store-staging and s3://veda-data-store using MWAA transfer dag
  • scheduled ingestion (bi weekly) json files are created and uploaded for NO2 (no2-monthly, no2-monthly-diff) in mwaa event bucket for staging (UAH) and production (MCP)
  • scheduled ingestion (bi weekly) json files are created and uploaded for Geoglam (geoglam) in mwaa event bucket for staging (UAH) and production (MCP)
  • Validate that new ingestions are initiated in staging and production. Add new geoglam files from GEOGLAM August 2024 #167, GEOGLAM September & October 2024 #173 into veda-data-store-staging bucket.
  • transfer, automated ingestion json configs are in veda-data
@slesaad
Copy link
Member

slesaad commented Jul 24, 2024

Putting the discovery-items config within s3://<EVENT_BUCKET>/collections/ in the following format: https://github.com/US-GHG-Center/ghgc-data/blob/add/lpdaac-dataset-scheduled-config/ingestion-data/discovery-items/scheduled/emit-ch4plume-v1-items.json will trigger the discovery and subsequent ingestion of the collection items based on the schedule attribute

@smohiudd
Copy link
Contributor Author

smohiudd commented Aug 1, 2024

mcp-prod will need a new release of airflow to include automated ingestion

@smohiudd
Copy link
Contributor Author

smohiudd commented Nov 6, 2024

aws s3 ls s3://covid-eo-data/OMNO2d_HRMDifference/

Image

aws s3 ls s3://covid-eo-data/OMNO2d_HRM/

Image

@sandrahoang686
Copy link
Contributor

Update: We have decided to run these weekly instead of bi-weekly

@anayeaye
Copy link
Contributor

I added the scheduled collection configs from veda-data #177 to mcp-test and mcp-production
Image

@anayeaye
Copy link
Contributor

anayeaye commented Dec 4, 2024

It looks like the uah-staging DAG has run for geoglam and discovered no files (expected). And the DAGs for the NO2 collections are visible but have not run (I need to revisit the configs to see if this is also expected). In mcp-prod the DAGs are present but have not yet run.

Question: Do we expect the scheduled ingest setup for SM2A to be the same as it was for MWAA?
2 files

Image
Image

@anayeaye
Copy link
Contributor

anayeaye commented Dec 9, 2024

UPDATE: Let's keep this open until I get a chance to add the config to the SM2A /collections bucket because we will be deprecating MWAA

@anayeaye anayeaye mentioned this issue Dec 12, 2024
1 task
@anayeaye
Copy link
Contributor

Configs now in SM2A. We may later update the scheduled job regex to be more restrictive to address a recurring filename pattern change for the geoglam collection #213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants