Make bytes->Socrata distribution a little more friendly (#1337)

* Fix oti_data_dictionary bug. This was preventing generation of xlsx oti data dicts, since we have `type: oti_data_dictionary` coded into all our templates * Enable distribution by ds name,type,id,tag formerly we were just distributing by tag. This adds the ability to also specify an id, a destination type, and ds name. * README updates * Align GHA name and emoji * post: use typer list for datasets filter And fix the docs * post: delete trailing whitespace in readme --------- Co-authored-by: Alex Richey <[email protected]>
NYCPlanning · Dec 31, 2024 · 89f8722 · 89f8722
1 parent 13366ee
commit 89f8722
Show file tree

Hide file tree

Showing 9 changed files with 159 additions and 48 deletions.
diff --git a/...hub/workflows/socrata_publish_dataset.yml → ...rkflows/distribute_socrata_dataset_s3.yml b/...hub/workflows/socrata_publish_dataset.yml → ...rkflows/distribute_socrata_dataset_s3.yml
@@ -1,5 +1,5 @@
-name: 🚀 Socrata - Publish a Dataset
-run-name: "🚀 Socrata - Publish a Dataset: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
+name: 📬 Socrata - Publish a Dataset
+run-name: "📬 Socrata - Publish a Dataset: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
 
 on:
   workflow_dispatch:

diff --git a/.github/workflows/distribute_socrata_from_bytes.yml b/.github/workflows/distribute_socrata_from_bytes.yml
@@ -1,5 +1,5 @@
-name: 🚀 Distribute Socrata From Bytes
-run-name: "🚀 Distribute Socrata From Bytes: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
+name: 📬 Distribute Socrata From Bytes
+run-name: "📬 Distribute Socrata From Bytes: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
 
 on:
   workflow_dispatch:

diff --git a/dcpy/lifecycle/distribute/README.md b/dcpy/lifecycle/distribute/README.md
@@ -3,10 +3,10 @@
 ## Terms
 - `Dataset`: Abstractly, a single table of data and metadata about it.
 - `Product`: A suite of datasets. E.g. LION, which contains multiple `datasets`, such as "2010 Census Blocks" or "Community Districts (water areas Included)"
-- `Dataset File`: Actual file export for a dataset, e.g. a csv, or a shapefile. Note: there may be slight variations between `dataset files` for the same `dataset`. e.g. columns in the shapefile for PLUTO will have slightly different columns than the csv. 
+- `Dataset File`: Actual file export for a dataset, e.g. a csv, or a shapefile. Note: there may be slight variations between `dataset files` for the same `dataset`. e.g. columns in the shapefile for PLUTO will have slightly different columns than the csv.
 - `Attachment`: README's, data dictionaries, etc.
-- `Dataset Package`: Instance of a `Dataset`, meaning metadata, attachments, and dataset files. 
-- `Product Package`: Versioned collection of Dataset Packages. 
+- `Dataset Package`: Instance of a `Dataset`, meaning metadata, attachments, and dataset files.
+- `Product Package`: Versioned collection of Dataset Packages.
 
 ## How to distribute to Socrata
 Socrata datasets consist of, effectively, one table. In the past, a single Socrata dataset could contain multiple shapefile layers, but that's no longer the case. So for us:
@@ -27,7 +27,7 @@ To push from S3, the Dataset Package must exist in `edm-publishing` in the `prod
 ```
 edm-publishing / product_dataset / {product_name} / package / {version} / {dataset_name} / [package files here]
 ```
-Note: In many cases the product has only one dataset, e.g. Facilities. In that case, the convention is to just use the same product_name and dataset_name. E.g 
+Note: In many cases the product has only one dataset, e.g. Facilities. In that case, the convention is to just use the same product_name and dataset_name. E.g
 ```
 edm-publishing / product_dataset / facilities / package / 24v2 / facilities / [package files here]
 ```
@@ -37,7 +37,7 @@ python -m dcpy.cli lifecycle distribute socrata from_s3 [args]
 ```
 The --help flag will list required args.
 
-Note that unless you explicitly specify `--publish`, only a draft will be created on Socrata, and you'll need to manually apply it via the socrata GUI. The output of the command will tell you the revision number, which you can find via search on Socrata. There you can review data changes, metadata changes, etc, before actually publishing. 
+Note that unless you explicitly specify `--publish`, only a draft will be created on Socrata, and you'll need to manually apply it via the socrata GUI. The output of the command will tell you the revision number, which you can find via search on Socrata. There you can review data changes, metadata changes, etc, before actually publishing.
 
 ##### Quick Tip
 If package validation fails, you can validate locally by running the validation commands locally:
@@ -48,15 +48,34 @@ python -m dcpy.cli lifecycle package validate [your package path here]
 #### Pushing from S3 via Github Action:
 [Relevant Action](https://github.com/NYCPlanning/data-engineering/actions/workflows/socrata_publish_dataset.yml). The same options as above apply. You'll need to supply the product name, the dataset name, and the version.
 
-Note: this is a low-risk operation when you don't tick the box to publish the dataset. 
+Note: this is a low-risk operation when you don't tick the box to publish the dataset.
+
+
+#### Assembling and Pushing from Bytes (local flow)
+1. Ensure you have the product-metadata repo cloned locally, and have set the `PRODUCT_METADATA_REPO_PATH` var
+2. Run the lifecycle package and dist command in scripts. This will allow you to target specific destinations for a given dataset. For each dataset specified, this command will pull relevant files from `bytes`, generate the OTI XLSX when specified in the files, then distribute the data to Socrata. This command can filter by dataset name, and destination tag. (note: the conditions are combined with AND logic)
+
+For Example
+``` sh
+python -m dcpy.cli lifecycle scripts package_and_dist from_bytes_to_socrata \
+  lion \
+  24d \
+  -y \
+  -e socrata \
+  -d atomic_polygons \
+  -d other_dataset \
+  -t socrata_unpublished
+```
+This will package and distribute to all socrata destinations for the `lion` product, version 24d (-y will skip dataset file validation) where the dataset is either `atomic_polygons` or `other_dataset`, where the destination is tagged as `socrata_unpublished`.
+
 
 ## The Socrata Publish Flow
-In our publishing connector, the flow for distributing is as follows: 
+In our publishing connector, the flow for distributing is as follows:
 
-1. create a new revision for the dataset, and discard other open revisions. 
-2. upload attachments, and update dataset-level metadata. (e.g. the dataset description or tags) 
-3. Upload the dataset itself. Currently only shapefiles are supported. 
-4. _Attempt_ to update column metadata for the uploaded dataset. This step is placed last because it's the most finicky at the moment, as it entails reconciling our uploaded columns, Socrata's existing columns, and our metadata. However, should this step fail, you can still go manually apply the revision in the Socrata GUI. 
+1. create a new revision for the dataset, and discard other open revisions.
+2. upload attachments, and update dataset-level metadata. (e.g. the dataset description or tags)
+3. Upload the dataset itself. Currently only shapefiles are supported.
+4. _Attempt_ to update column metadata for the uploaded dataset. This step is placed last because it's the most finicky at the moment, as it entails reconciling our uploaded columns, Socrata's existing columns, and our metadata. However, should this step fail, you can still go manually apply the revision in the Socrata GUI.
 
 ## Applying Revisions in Socrata
 
@@ -72,9 +91,12 @@ INFO:dcpy:Finished syncing product to Socrata, but did not publish. Find revisio
 INFO:dcpy:            here https://data.cityofnewyork.us/d/b7pm-uzu7/revisions/32
 ```
 
-Follow the provided link. Here you can review the modified data and metadata. Hit `Update` in the top right to apply the revision. 
+Follow the provided link. Here you can review the modified data and metadata. Hit `Update` in the top right to apply the revision.
 ![template_db_socrata](https://github.com/NYCPlanning/data-engineering/assets/11164730/b0c24251-00e3-4be1-99a6-6cf015240cc6)
 
+Before publishing,
+- check the row count
+- review the "Metadata Changes" (hit the Details dropdown). Make sure that everything looks fine. (e.g. you haven't removed fields, or completely removed an attachment, etc.)
 
 ## Generating Metadata
 
@@ -120,3 +142,12 @@ In this case, make sure you're logged in, and then:
     - Delete the old attachment(s)
     - Upload the new attachemt(s)
 
+
+
+
+#### POTENTIAL ISSUES
+
+##### I've pushed to socrata, but when I visit the revision page, the `Update` button is greyed out.
+Hover over the `Update` button, and it should point you towards the cause. Usually it's a metadata problem.
+
+*If it's a metadata problem, but nothing seems wrong (ie nothing is bright red in the metadata modal)* Usually you can fix this by adding a space and removing it, or some similar non-change. Hit `Save`, and likely you'll be able to update.
diff --git a/dcpy/lifecycle/package/xlsx_writer.py b/dcpy/lifecycle/package/xlsx_writer.py
@@ -19,7 +19,7 @@
 
 # TODO: Move template to Product Metadata Repo. Rename to be non-OTI specific
 DEFAULT_TEMPLATE_PATH = RESOURCES_PATH / "oti_data_dictionary_template.xlsx"
-EXCEL_DATA_DICT_METADATA_FILE_TYPE = "excel_data_dictionary"
+EXCEL_DATA_DICT_METADATA_FILE_TYPE = "oti_data_dictionary"
 DEFAULT_FONT = "Arial"
 
 

diff --git a/dcpy/lifecycle/scripts/package_and_distribute.py b/dcpy/lifecycle/scripts/package_and_distribute.py
@@ -3,26 +3,38 @@
 from typing import Unpack
 
 from dcpy.configuration import PRODUCT_METADATA_REPO_PATH
+
 from dcpy.models.product import metadata as product_metadata
 from dcpy.lifecycle.distribute import socrata as soc_dist
 from dcpy.lifecycle.package import assemble
 from dcpy.utils.logging import logger
 
 
-def from_bytes_to_tagged_socrata(
+def package_and_dist_from_bytes(
     org_metadata_path: Path,
     product: str,
     version: str,
     destination_tag: str,
+    destination_id: str | None,
+    datasets: set[str],
+    destination_type: str,
     **publish_kwargs: Unpack[soc_dist.PublishKwargs],
 ):
+    logger.info(
+        f"Packaging and Distributing with filters: tag={destination_tag}, datasets: {datasets}, destination_type: {destination_type} "
+    )
     """Package tagged datsets from bytes, and distribute to Socrata."""
     org_md = product_metadata.OrgMetadata.from_path(
         path=org_metadata_path,
         template_vars={"version": version},
     )
     product_md = org_md.product(product)
-    dests = product_md.get_tagged_destinations(destination_tag)
+    dests = product_md.query_destinations(
+        tag=destination_tag,
+        datasets=set(datasets),
+        destination_type=destination_type,
+        destination_id=destination_id,
+    )
 
     logger.info(f"Packaging {product_md.metadata.id}. Datasets: {list(dests.keys())}")
     package_paths = {}
@@ -52,21 +64,41 @@ def from_bytes_to_tagged_socrata(
 app = typer.Typer()
 
 
-@app.command("from_bytes_to_tagged_socrata")
+@app.command("from_bytes_to_socrata")
 def from_bytes_to_tagged_socrata_cli(
     product: str,
     version: str,
-    org_metadata_path: Path = typer.Option(
-        PRODUCT_METADATA_REPO_PATH,
-        "-o",
-        "--metadata-path",
-        help="Path to metadata repo. Optionally, set in your env.",
+    # Filters
+    datasets: list[str] = typer.Option(
+        None,
+        "-d",
+        "--datasets",
+        help="list of dataset names to include.",
     ),
     destination_tag: str = typer.Option(
         None,
         "-t",
         "--dest-tag",
-        help="Destination tag to package and distribute",
+        help="Destination tag to filter for.",
+    ),
+    destination_id: str = typer.Option(
+        None,
+        "-a",
+        "--dest-id",
+        help="Destination ID",
+    ),
+    destination_type: str = typer.Option(
+        None,
+        "-e",
+        "--dest-type",
+        help="Destination type for filter for. e.g. 'Socrata'",
+    ),
+    # Overrides
+    org_metadata_path: Path = typer.Option(
+        PRODUCT_METADATA_REPO_PATH,
+        "-o",
+        "--metadata-path",
+        help="Path to metadata repo. Optionally, set in your env.",
     ),
     publish: bool = typer.Option(
         False,
@@ -93,11 +125,14 @@ def from_bytes_to_tagged_socrata_cli(
         help="Only push metadata (including attachments).",
     ),
 ):
-    results = from_bytes_to_tagged_socrata(
+    results = package_and_dist_from_bytes(
         org_metadata_path,
         product,
         version,
-        destination_tag,
+        datasets=set(datasets or []),
+        destination_id=destination_id,
+        destination_type=destination_type,
+        destination_tag=destination_tag,
         publish=publish,
         ignore_validation_errors=ignore_validation_errors,
         skip_validation=skip_validation,

diff --git a/dcpy/models/product/metadata.py b/dcpy/models/product/metadata.py
@@ -84,14 +84,33 @@ def get_datasets_by_id(self) -> dict[str, DatasetMetadata]:
         dataset_mds = [self.dataset(ds_id) for ds_id in self.metadata.datasets]
         return {m.id: m for m in dataset_mds}
 
-    def get_tagged_destinations(self, tag) -> dict[str, dict[str, DatasetMetadata]]:
-        datasets = self.get_datasets_by_id()
-        found_tagged_dests: dict[str, dict[str, DatasetMetadata]] = defaultdict(dict)
-        for ds in datasets.values():
+    def query_destinations(
+        self,
+        *,
+        datasets: set[str] | None = None,
+        destination_id: str | None = None,
+        destination_type: str | None = None,
+        tag: str | None = None,
+    ) -> dict[str, dict[str, DatasetMetadata]]:
+        """Retrieve a map[map] of dataset->destination->DatasetMetadata filtered by
+           - destination type. (e.g. Socrata)
+           - dataset name
+           - tags
+
+        e.g. for LION: {"2020_census_blocks": {"socrata_water_included": [Fully rendered metadata for this destination]}}
+        """
+        filtered_datasets = self.get_datasets_by_id()
+        found_dests: dict[str, dict[str, DatasetMetadata]] = defaultdict(dict)
+        for ds in filtered_datasets.values():
             for dest in ds.destinations:
-                if tag in dest.tags:
-                    found_tagged_dests[ds.id][dest.id] = ds
-        return found_tagged_dests
+                if (
+                    (not destination_type or dest.type == destination_type)
+                    and (not destination_id or dest.id == destination_id)
+                    and (not datasets or ds.id in datasets)
+                    and (not tag or tag in dest.tags)
+                ):
+                    found_dests[ds.id][dest.id] = ds
+        return found_dests
 
     def validate_dataset_metadata(self) -> dict[str, list[str]]:
         product_errors = {}

diff --git a/dcpy/test/models/product/test_metadata.py b/dcpy/test/models/product/test_metadata.py
@@ -55,15 +55,41 @@ def test_org_md_overrides(test_metadata_repo: Path):
     ), "The field `agency` should use the org-level default"
 
 
+def test_query_destinations_by_type(lion_md_path: Path):
+    lion_product = md.ProductMetadata.from_path(root_path=lion_md_path)
+
+    DEST_TYPE = "socrata"
+    datasets = lion_product.query_destinations(destination_type=DEST_TYPE)
+
+    assert 2 == len(datasets.keys())
+    assert "school_districts" in datasets
+    assert datasets["school_districts"].keys() == {"socrata", "socrata_2"}
+
+
 def test_get_tagged_destinations(lion_md_path: Path):
     product_folder = md.ProductMetadata.from_path(root_path=lion_md_path)
 
-    TAG = "prod_tag"
-    datasets = product_folder.get_tagged_destinations(TAG)
+    TAG = "school_districts_tag"
+    datasets = product_folder.query_destinations(tag=TAG)
 
     assert 1 == len(datasets.keys())
     assert "school_districts" in datasets
-    assert datasets["school_districts"].keys() == {"socrata"}
+    assert datasets["school_districts"].keys() == {"socrata_2"}
+
+
+def test_query_multiple_filters_destinations(lion_md_path: Path):
+    product_folder = md.ProductMetadata.from_path(root_path=lion_md_path)
+
+    TAG = "prod_tag"
+    DEST_TYPE = "socrata"
+    DATASET_NAMES = {"pseudo_lots", "school_districts"}
+    datasets = product_folder.query_destinations(
+        tag=TAG, destination_type=DEST_TYPE, datasets=DATASET_NAMES
+    )
+
+    assert DATASET_NAMES == datasets.keys(), "The correct datasets should be returned"
+    for ds in DATASET_NAMES:
+        assert datasets[ds].keys() == {"socrata"}
 
 
 def test_product_metadata_validation(lion_md_path: Path):
@@ -112,16 +138,6 @@ def test_product_validation_with_error_product(test_metadata_repo: Path):
     assert reference_error.startswith(ds_md.ERROR_MISSING_COLUMN)
 
 
-def test_query_product_dataset_tags(test_metadata_repo: Path):
-    TAG = "prod_tag"
-    repo = md.OrgMetadata.from_path(test_metadata_repo)
-    assert [
-        md.ProductDatasetDestinationKey(
-            product="lion", dataset="school_districts", destination="socrata"
-        )
-    ] == repo.query_dataset_destinations(TAG)
-
-
 @pytest.fixture
 def test_metadata_repo_snippets(resources_path: Path):
     yield resources_path / "test_product_metadata_repo_with_snippets"

diff --git a/dcpy/test/resources/test_product_metadata_repo/products/lion/pseudo_lots/metadata.yml b/dcpy/test/resources/test_product_metadata_repo/products/lion/pseudo_lots/metadata.yml
@@ -16,3 +16,8 @@ attributes:
     - building identification number
     - addresses
     - cscl
+
+destinations:
+  - id: socrata
+    type: socrata
+    tags: [prod_tag, pseudo_lots_tag]
diff --git a/dcpy/test/resources/test_product_metadata_repo/products/lion/school_districts/metadata.yml b/dcpy/test/resources/test_product_metadata_repo/products/lion/school_districts/metadata.yml
@@ -19,3 +19,8 @@ destinations:
   - id: socrata
     type: socrata
     tags: [prod_tag]
+  - id: socrata_2
+    type: socrata
+    tags: [school_districts_tag]
+  - id: other
+    type: bytes