Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ntl-metadata #84

Merged
merged 37 commits into from
Nov 26, 2024
Merged

Feature/ntl-metadata #84

merged 37 commits into from
Nov 26, 2024

Conversation

Gabe-Levin
Copy link
Collaborator

@Gabe-Levin Gabe-Levin commented Nov 4, 2024

Add script for linking new items to the existing STAC collection, and performance first NTL data upload

What I Changed:

  • Created link_new_item.py to be able to add a new Item and link it to the existing collection
  • Updated the ingest CLI command to include the name of the item to check. Specifically in the verify_columns command within the load_parquet_to_db.
  • Added new rows for the 2013 NTL data to the Metadata Excel
  • Added a new column to the feature catalog of the Metadata Excel. This is to help accommodate multiple STAC files in test_stac_columns.py (and perhaps other tests later on)
  • Changed the "TABLE NAME" on line 10 of METADATA/main.py to create a new NTL table within the existing local database, without overwriting the existing Pop data

How the new Item was created:

  • Download the parquet file from the following link: "s3://wbg-geography01/Space2Stats/parquet/GLOBAL/NTL_VIIRS_LEN_2013_combined.parquet"
  • In link_new_item.py set "Paths and metadata setup" in the main function to point towards the corresponding locally saved parquet file
  • Navigate to the METADATA sub-directory and run the following commands in order:
    1. get_types.py
    2. line_new_items.py
  • Once the new item is created and linked, ingest the data using the updated "Load" CLI command.

Ex.

poetry run space2stats-ingest load \
   "postgresql://username:password@localhost:5439/postgis" \
   "space2stats_ingest/METADATA/stac/catalog.json" \
   "space2stats_ntl_2013.json" \
   --parquet-file "ntl2013.parquet"

@Gabe-Levin Gabe-Levin force-pushed the feature/ntl-metadata branch from ccbe8cb to c86f4ac Compare November 4, 2024 15:22
@Gabe-Levin
Copy link
Collaborator Author

Gabe-Levin commented Nov 7, 2024

I am running into an issue with the test_ingest.py test, “test_load_parquet_to_db"

When the test is adding the two rows of dummy data to the database

data = {
        "hex_id": ["hex_1", "hex_2"],
        "sum_pop_f_10_2020": [100, 200],
        "sum_pop_m_10_2020": [150, 250],
    }

the hex_id is being converted by the database to the following:

 Rows in space2stats:
('862a1070fffffff', 100, 200)
('862a10767ffffff', 150, 250)
('862a1073fffffff', 120, 220)
('867a74817ffffff', 125, 225)
('867a74807ffffff', 125, 225)

I can’t seem to find where this conversion is taking place. Nor can I explain why 2 rows were inserted but 5 are visible when querying the entire db.

Here is the command I used to display the contents of the database from within the test:

with psycopg.connect(connection_string) as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT * FROM space2stats;")
            rows = cur.fetchall()
            print("Rows in space2stats:", rows)

For context:
I currently have a table (space2stats) in my local database which has the hex_ids converted in the same way. This makes me think that postgis is conducting this conversion on the database side, but I’m not certain.

image

@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 8, 2024 21:03 — with GitHub Actions Inactive
Copy link

github-actions bot commented Nov 8, 2024

PR Deployment Details:
🚀 PR deployed to https://qalsmdlfa4.execute-api.us-east-1.amazonaws.com/

@Gabe-Levin
Copy link
Collaborator Author

Snippit from space2stats_api/src/README.md:

Adding New STAC Item Files

To add new STAC Items, follow these steps:

  1. Update Paths and Metadata:

    • In get_types.py, update the parquet_file variable in the main() function to point to your local Parquet file.
    • In link_new_item.py, set the variables in the section labeled Paths and Metadata Setup within the main() function.
  2. Update Metadata File:

    • Add a new entry in the Source tab of the METADATA/Space2Stats Metadata Content.xlsx file if it doesn’t already exist.
  3. Run Metadata Scripts:

    • Navigate to the METADATA sub-directory and execute the following commands in order:
      1. python get_types.py
      2. python link_new_items.py

@Gabe-Levin Gabe-Levin changed the title [WIP] Feature/ntl-metadata Feature/ntl-metadata Nov 8, 2024
@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 11, 2024 10:41 — with GitHub Actions Inactive
@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 12, 2024 13:47 — with GitHub Actions Inactive
@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 14, 2024 09:38 — with GitHub Actions Inactive
@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 15, 2024 14:35 — with GitHub Actions Inactive
@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 20, 2024 16:00 — with GitHub Actions Inactive
@zacdezgeo
Copy link
Collaborator

Alright, we can now merge this branch! Need any review from our part here? We should merge main into this branch to get the update table logic for ingestion. When ready, you should try ingesting into the development database. If things go smoothly, we could ingest to the prod database and merge. Let me know if I can help!

@Gabe-Levin Gabe-Levin temporarily deployed to Space2Stats API Dev November 22, 2024 11:24 — with GitHub Actions Inactive
@Gabe-Levin
Copy link
Collaborator Author

Merge was successful, but I don't have the dev env credentials so I was unable to test the ingest.

@Gabe-Levin
Copy link
Collaborator Author

While running poetry install I got the following error:

  Package https://temp-wheels-cdk.s3.us-east-1.amazonaws.com/h3ronpy-0.21.1-cp39-abi3-linux_x86_64.whl cannot be installed in the current environment {'implementation_name': 'cpython', 'implementation_version': '3.12.7', 'os_name': 'posix', 'platform_machine': 'arm64', 'platform_release': '23.4.0', 'platform_system': 'Darwin', 'platform_version': 'Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:54 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6030', 'python_full_version': '3.12.7', 'platform_python_implementation': 'CPython', 'python_version': '3.12', 'sys_platform': 'darwin', 'version_info': [3, 12, 7, 'final', 0], 'interpreter_name': 'cp', 'interpreter_version': '3_12'}

  at /opt/homebrew/Caskroom/miniconda/base/envs/wb/lib/python3.12/site-packages/poetry/installation/executor.py:773 in _download_link
      769│             # Since we previously downloaded an archive, we now should have
      770│             # something cached that we can use here. The only case in which
      771│             # archive is None is if the original archive is not valid for the
      772│             # current environment.
    → 773│             raise RuntimeError(
      774│                 f"Package {link.url} cannot be installed in the current environment"
      775│                 f" {self._env.marker_env}"
      776│             )
      777│

Cannot install h3ronpy.

I think this might be related to your issue related to using a pre-built wheel in s3, but want to be sure.

And then when I try to use the following command from the Space2Stats/space2stats_api/src dir:

poetry run space2stats-ingest load \
    "{connection string}" \
    "./space2stats_ingest/METADATA/stac/space2stats-collection/nighttime_lights_2013/nighttime_lights_2013.json" \
    "local_ntl13.parquet"

I get the following error:

(wb) ➜  src git:(feature/ntl-metadata) ✗ poetry run space2stats-ingest load \
    "postgresql://postgres:[email protected]:5432/postgres" \
    "./space2stats_ingest/METADATA/stac/space2stats-collection/nighttime_lights_2013/nighttime_lights_2013.json" \
    "local_ntl13.parquet"
Warning: 'space2stats-ingest' is an entry point defined in pyproject.toml, but it's not installed as a script. You may get improper `sys.argv[0]`.

The support to run uninstalled scripts will be removed in a future release.

Run `poetry install` to resolve and get rid of this message.

Usage: space2stats-ingest [OPTIONS] CONNECTION_STRING STAC_ITEM_PATH
                          PARQUET_FILE
Try 'space2stats-ingest --help' for help.
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Got unexpected extra argument (local_ntl13.parquet)

@zacdezgeo
Copy link
Collaborator

@Gabe-Levin: it's a hassle right now because h3ronpy does not have a new release available, and we must use pre-built wheels. Hopefully, you can replace the h3ronpy URL in the pyproject.toml with this:

h3ronpy = { url = "https://temp-wheels-cdk.s3.us-east-1.amazonaws.com/h3ronpy-0.21.1-cp39-abi3-macosx_11_0_arm64.whl" }

It worked for my MacOS. The Linux version in main works for the CD pipeline.

We will change this whenever possible, as we've noted with #93. Could you let me know if this helps solve your problem? If not, we must build a wheel precisely for your environment.

@andresfchamorro andresfchamorro merged commit 7839577 into main Nov 26, 2024
3 checks passed
@andresfchamorro andresfchamorro deleted the feature/ntl-metadata branch November 26, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants