Skip to content

Commit

Permalink
Make bytes->Socrata distribution a little more friendly (#1337)
Browse files Browse the repository at this point in the history
* Fix oti_data_dictionary bug.

This was preventing generation of xlsx oti data dicts,
since we have `type: oti_data_dictionary` coded into all our templates

* Enable distribution by ds name,type,id,tag

formerly we were just distributing by tag. This adds the ability to also
specify an id, a destination type, and ds name.

* README updates

* Align GHA name and emoji

* post: use typer list for datasets filter

And fix the docs

* post: delete trailing whitespace in readme

---------

Co-authored-by: Alex Richey <[email protected]>
  • Loading branch information
alexrichey and Alex Richey authored Dec 31, 2024
1 parent 13366ee commit 89f8722
Show file tree
Hide file tree
Showing 9 changed files with 159 additions and 48 deletions.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 🚀 Socrata - Publish a Dataset
run-name: "🚀 Socrata - Publish a Dataset: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
name: 📬 Socrata - Publish a Dataset
run-name: "📬 Socrata - Publish a Dataset: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"

on:
workflow_dispatch:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/distribute_socrata_from_bytes.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 🚀 Distribute Socrata From Bytes
run-name: "🚀 Distribute Socrata From Bytes: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"
name: 📬 Distribute Socrata From Bytes
run-name: "📬 Distribute Socrata From Bytes: ${{ inputs.PRODUCT_NAME }}-${{ inputs.DATASET_VERSION }}-${{ inputs.DATASET }}-${{ inputs.DESTINATION_ID }}"

on:
workflow_dispatch:
Expand Down
55 changes: 43 additions & 12 deletions dcpy/lifecycle/distribute/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
## Terms
- `Dataset`: Abstractly, a single table of data and metadata about it.
- `Product`: A suite of datasets. E.g. LION, which contains multiple `datasets`, such as "2010 Census Blocks" or "Community Districts (water areas Included)"
- `Dataset File`: Actual file export for a dataset, e.g. a csv, or a shapefile. Note: there may be slight variations between `dataset files` for the same `dataset`. e.g. columns in the shapefile for PLUTO will have slightly different columns than the csv.
- `Dataset File`: Actual file export for a dataset, e.g. a csv, or a shapefile. Note: there may be slight variations between `dataset files` for the same `dataset`. e.g. columns in the shapefile for PLUTO will have slightly different columns than the csv.
- `Attachment`: README's, data dictionaries, etc.
- `Dataset Package`: Instance of a `Dataset`, meaning metadata, attachments, and dataset files.
- `Product Package`: Versioned collection of Dataset Packages.
- `Dataset Package`: Instance of a `Dataset`, meaning metadata, attachments, and dataset files.
- `Product Package`: Versioned collection of Dataset Packages.

## How to distribute to Socrata
Socrata datasets consist of, effectively, one table. In the past, a single Socrata dataset could contain multiple shapefile layers, but that's no longer the case. So for us:
Expand All @@ -27,7 +27,7 @@ To push from S3, the Dataset Package must exist in `edm-publishing` in the `prod
```
edm-publishing / product_dataset / {product_name} / package / {version} / {dataset_name} / [package files here]
```
Note: In many cases the product has only one dataset, e.g. Facilities. In that case, the convention is to just use the same product_name and dataset_name. E.g
Note: In many cases the product has only one dataset, e.g. Facilities. In that case, the convention is to just use the same product_name and dataset_name. E.g
```
edm-publishing / product_dataset / facilities / package / 24v2 / facilities / [package files here]
```
Expand All @@ -37,7 +37,7 @@ python -m dcpy.cli lifecycle distribute socrata from_s3 [args]
```
The --help flag will list required args.

Note that unless you explicitly specify `--publish`, only a draft will be created on Socrata, and you'll need to manually apply it via the socrata GUI. The output of the command will tell you the revision number, which you can find via search on Socrata. There you can review data changes, metadata changes, etc, before actually publishing.
Note that unless you explicitly specify `--publish`, only a draft will be created on Socrata, and you'll need to manually apply it via the socrata GUI. The output of the command will tell you the revision number, which you can find via search on Socrata. There you can review data changes, metadata changes, etc, before actually publishing.

##### Quick Tip
If package validation fails, you can validate locally by running the validation commands locally:
Expand All @@ -48,15 +48,34 @@ python -m dcpy.cli lifecycle package validate [your package path here]
#### Pushing from S3 via Github Action:
[Relevant Action](https://github.com/NYCPlanning/data-engineering/actions/workflows/socrata_publish_dataset.yml). The same options as above apply. You'll need to supply the product name, the dataset name, and the version.

Note: this is a low-risk operation when you don't tick the box to publish the dataset.
Note: this is a low-risk operation when you don't tick the box to publish the dataset.


#### Assembling and Pushing from Bytes (local flow)
1. Ensure you have the product-metadata repo cloned locally, and have set the `PRODUCT_METADATA_REPO_PATH` var
2. Run the lifecycle package and dist command in scripts. This will allow you to target specific destinations for a given dataset. For each dataset specified, this command will pull relevant files from `bytes`, generate the OTI XLSX when specified in the files, then distribute the data to Socrata. This command can filter by dataset name, and destination tag. (note: the conditions are combined with AND logic)

For Example
``` sh
python -m dcpy.cli lifecycle scripts package_and_dist from_bytes_to_socrata \
lion \
24d \
-y \
-e socrata \
-d atomic_polygons \
-d other_dataset \
-t socrata_unpublished
```
This will package and distribute to all socrata destinations for the `lion` product, version 24d (-y will skip dataset file validation) where the dataset is either `atomic_polygons` or `other_dataset`, where the destination is tagged as `socrata_unpublished`.


## The Socrata Publish Flow
In our publishing connector, the flow for distributing is as follows:
In our publishing connector, the flow for distributing is as follows:

1. create a new revision for the dataset, and discard other open revisions.
2. upload attachments, and update dataset-level metadata. (e.g. the dataset description or tags)
3. Upload the dataset itself. Currently only shapefiles are supported.
4. _Attempt_ to update column metadata for the uploaded dataset. This step is placed last because it's the most finicky at the moment, as it entails reconciling our uploaded columns, Socrata's existing columns, and our metadata. However, should this step fail, you can still go manually apply the revision in the Socrata GUI.
1. create a new revision for the dataset, and discard other open revisions.
2. upload attachments, and update dataset-level metadata. (e.g. the dataset description or tags)
3. Upload the dataset itself. Currently only shapefiles are supported.
4. _Attempt_ to update column metadata for the uploaded dataset. This step is placed last because it's the most finicky at the moment, as it entails reconciling our uploaded columns, Socrata's existing columns, and our metadata. However, should this step fail, you can still go manually apply the revision in the Socrata GUI.

## Applying Revisions in Socrata

Expand All @@ -72,9 +91,12 @@ INFO:dcpy:Finished syncing product to Socrata, but did not publish. Find revisio
INFO:dcpy: here https://data.cityofnewyork.us/d/b7pm-uzu7/revisions/32
```

Follow the provided link. Here you can review the modified data and metadata. Hit `Update` in the top right to apply the revision.
Follow the provided link. Here you can review the modified data and metadata. Hit `Update` in the top right to apply the revision.
![template_db_socrata](https://github.com/NYCPlanning/data-engineering/assets/11164730/b0c24251-00e3-4be1-99a6-6cf015240cc6)

Before publishing,
- check the row count
- review the "Metadata Changes" (hit the Details dropdown). Make sure that everything looks fine. (e.g. you haven't removed fields, or completely removed an attachment, etc.)

## Generating Metadata

Expand Down Expand Up @@ -120,3 +142,12 @@ In this case, make sure you're logged in, and then:
- Delete the old attachment(s)
- Upload the new attachemt(s)




#### POTENTIAL ISSUES

##### I've pushed to socrata, but when I visit the revision page, the `Update` button is greyed out.
Hover over the `Update` button, and it should point you towards the cause. Usually it's a metadata problem.

*If it's a metadata problem, but nothing seems wrong (ie nothing is bright red in the metadata modal)* Usually you can fix this by adding a space and removing it, or some similar non-change. Hit `Save`, and likely you'll be able to update.
2 changes: 1 addition & 1 deletion dcpy/lifecycle/package/xlsx_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

# TODO: Move template to Product Metadata Repo. Rename to be non-OTI specific
DEFAULT_TEMPLATE_PATH = RESOURCES_PATH / "oti_data_dictionary_template.xlsx"
EXCEL_DATA_DICT_METADATA_FILE_TYPE = "excel_data_dictionary"
EXCEL_DATA_DICT_METADATA_FILE_TYPE = "oti_data_dictionary"
DEFAULT_FONT = "Arial"


Expand Down
57 changes: 46 additions & 11 deletions dcpy/lifecycle/scripts/package_and_distribute.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,38 @@
from typing import Unpack

from dcpy.configuration import PRODUCT_METADATA_REPO_PATH

from dcpy.models.product import metadata as product_metadata
from dcpy.lifecycle.distribute import socrata as soc_dist
from dcpy.lifecycle.package import assemble
from dcpy.utils.logging import logger


def from_bytes_to_tagged_socrata(
def package_and_dist_from_bytes(
org_metadata_path: Path,
product: str,
version: str,
destination_tag: str,
destination_id: str | None,
datasets: set[str],
destination_type: str,
**publish_kwargs: Unpack[soc_dist.PublishKwargs],
):
logger.info(
f"Packaging and Distributing with filters: tag={destination_tag}, datasets: {datasets}, destination_type: {destination_type} "
)
"""Package tagged datsets from bytes, and distribute to Socrata."""
org_md = product_metadata.OrgMetadata.from_path(
path=org_metadata_path,
template_vars={"version": version},
)
product_md = org_md.product(product)
dests = product_md.get_tagged_destinations(destination_tag)
dests = product_md.query_destinations(
tag=destination_tag,
datasets=set(datasets),
destination_type=destination_type,
destination_id=destination_id,
)

logger.info(f"Packaging {product_md.metadata.id}. Datasets: {list(dests.keys())}")
package_paths = {}
Expand Down Expand Up @@ -52,21 +64,41 @@ def from_bytes_to_tagged_socrata(
app = typer.Typer()


@app.command("from_bytes_to_tagged_socrata")
@app.command("from_bytes_to_socrata")
def from_bytes_to_tagged_socrata_cli(
product: str,
version: str,
org_metadata_path: Path = typer.Option(
PRODUCT_METADATA_REPO_PATH,
"-o",
"--metadata-path",
help="Path to metadata repo. Optionally, set in your env.",
# Filters
datasets: list[str] = typer.Option(
None,
"-d",
"--datasets",
help="list of dataset names to include.",
),
destination_tag: str = typer.Option(
None,
"-t",
"--dest-tag",
help="Destination tag to package and distribute",
help="Destination tag to filter for.",
),
destination_id: str = typer.Option(
None,
"-a",
"--dest-id",
help="Destination ID",
),
destination_type: str = typer.Option(
None,
"-e",
"--dest-type",
help="Destination type for filter for. e.g. 'Socrata'",
),
# Overrides
org_metadata_path: Path = typer.Option(
PRODUCT_METADATA_REPO_PATH,
"-o",
"--metadata-path",
help="Path to metadata repo. Optionally, set in your env.",
),
publish: bool = typer.Option(
False,
Expand All @@ -93,11 +125,14 @@ def from_bytes_to_tagged_socrata_cli(
help="Only push metadata (including attachments).",
),
):
results = from_bytes_to_tagged_socrata(
results = package_and_dist_from_bytes(
org_metadata_path,
product,
version,
destination_tag,
datasets=set(datasets or []),
destination_id=destination_id,
destination_type=destination_type,
destination_tag=destination_tag,
publish=publish,
ignore_validation_errors=ignore_validation_errors,
skip_validation=skip_validation,
Expand Down
33 changes: 26 additions & 7 deletions dcpy/models/product/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,14 +84,33 @@ def get_datasets_by_id(self) -> dict[str, DatasetMetadata]:
dataset_mds = [self.dataset(ds_id) for ds_id in self.metadata.datasets]
return {m.id: m for m in dataset_mds}

def get_tagged_destinations(self, tag) -> dict[str, dict[str, DatasetMetadata]]:
datasets = self.get_datasets_by_id()
found_tagged_dests: dict[str, dict[str, DatasetMetadata]] = defaultdict(dict)
for ds in datasets.values():
def query_destinations(
self,
*,
datasets: set[str] | None = None,
destination_id: str | None = None,
destination_type: str | None = None,
tag: str | None = None,
) -> dict[str, dict[str, DatasetMetadata]]:
"""Retrieve a map[map] of dataset->destination->DatasetMetadata filtered by
- destination type. (e.g. Socrata)
- dataset name
- tags
e.g. for LION: {"2020_census_blocks": {"socrata_water_included": [Fully rendered metadata for this destination]}}
"""
filtered_datasets = self.get_datasets_by_id()
found_dests: dict[str, dict[str, DatasetMetadata]] = defaultdict(dict)
for ds in filtered_datasets.values():
for dest in ds.destinations:
if tag in dest.tags:
found_tagged_dests[ds.id][dest.id] = ds
return found_tagged_dests
if (
(not destination_type or dest.type == destination_type)
and (not destination_id or dest.id == destination_id)
and (not datasets or ds.id in datasets)
and (not tag or tag in dest.tags)
):
found_dests[ds.id][dest.id] = ds
return found_dests

def validate_dataset_metadata(self) -> dict[str, list[str]]:
product_errors = {}
Expand Down
42 changes: 29 additions & 13 deletions dcpy/test/models/product/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,41 @@ def test_org_md_overrides(test_metadata_repo: Path):
), "The field `agency` should use the org-level default"


def test_query_destinations_by_type(lion_md_path: Path):
lion_product = md.ProductMetadata.from_path(root_path=lion_md_path)

DEST_TYPE = "socrata"
datasets = lion_product.query_destinations(destination_type=DEST_TYPE)

assert 2 == len(datasets.keys())
assert "school_districts" in datasets
assert datasets["school_districts"].keys() == {"socrata", "socrata_2"}


def test_get_tagged_destinations(lion_md_path: Path):
product_folder = md.ProductMetadata.from_path(root_path=lion_md_path)

TAG = "prod_tag"
datasets = product_folder.get_tagged_destinations(TAG)
TAG = "school_districts_tag"
datasets = product_folder.query_destinations(tag=TAG)

assert 1 == len(datasets.keys())
assert "school_districts" in datasets
assert datasets["school_districts"].keys() == {"socrata"}
assert datasets["school_districts"].keys() == {"socrata_2"}


def test_query_multiple_filters_destinations(lion_md_path: Path):
product_folder = md.ProductMetadata.from_path(root_path=lion_md_path)

TAG = "prod_tag"
DEST_TYPE = "socrata"
DATASET_NAMES = {"pseudo_lots", "school_districts"}
datasets = product_folder.query_destinations(
tag=TAG, destination_type=DEST_TYPE, datasets=DATASET_NAMES
)

assert DATASET_NAMES == datasets.keys(), "The correct datasets should be returned"
for ds in DATASET_NAMES:
assert datasets[ds].keys() == {"socrata"}


def test_product_metadata_validation(lion_md_path: Path):
Expand Down Expand Up @@ -112,16 +138,6 @@ def test_product_validation_with_error_product(test_metadata_repo: Path):
assert reference_error.startswith(ds_md.ERROR_MISSING_COLUMN)


def test_query_product_dataset_tags(test_metadata_repo: Path):
TAG = "prod_tag"
repo = md.OrgMetadata.from_path(test_metadata_repo)
assert [
md.ProductDatasetDestinationKey(
product="lion", dataset="school_districts", destination="socrata"
)
] == repo.query_dataset_destinations(TAG)


@pytest.fixture
def test_metadata_repo_snippets(resources_path: Path):
yield resources_path / "test_product_metadata_repo_with_snippets"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,8 @@ attributes:
- building identification number
- addresses
- cscl

destinations:
- id: socrata
type: socrata
tags: [prod_tag, pseudo_lots_tag]
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,8 @@ destinations:
- id: socrata
type: socrata
tags: [prod_tag]
- id: socrata_2
type: socrata
tags: [school_districts_tag]
- id: other
type: bytes

0 comments on commit 89f8722

Please sign in to comment.