basic Annex A pipeline #68

amynickolls · 2024-12-18T17:26:20Z

produces a cleanfile output of 1 csv per annex a list
create a reports output for each list for the region
make the usual logs and outputs (clean, concat, reports) available in the standard places in the infrastructure
do not allow "current" and "aggregated" datasets to flow into final outputs

…anceDigitalLabs/liia-tools into school_census_pipeline

…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build

codecov · 2024-12-23T13:17:55Z

Codecov Report

Attention: Patch coverage is 72.46964% with 68 lines in your changes missing coverage. Please review.

Project coverage is 74.22%. Comparing base (3ca4f7b) to head (fcbcac8).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
liiatools/common/stream_filters.py	48.48%	17 Missing ⚠️
liiatools/annex_a_pipeline/stream_pipeline.py	36.00%	16 Missing ⚠️
liiatools/common/pipeline.py	33.33%	10 Missing ⚠️
liiatools/common/converters.py	30.76%	9 Missing ⚠️
liiatools/common/stream_pipeline.py	20.00%	4 Missing ⚠️
liiatools/annex_a_pipeline/spec/__init__.py	85.71%	3 Missing ⚠️
...iiatools/ssda903_pipeline/sufficiency_transform.py	0.00%	3 Missing ⚠️
liiatools/common/_transform_functions.py	60.00%	2 Missing ⚠️
liiatools/common/spec/__data_schema.py	50.00%	2 Missing ⚠️
liiatools/ssda903_pipeline/pipeline.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #68      +/-   ##
==========================================
- Coverage   74.60%   74.22%   -0.38%     
==========================================
  Files          60       66       +6     
  Lines        3410     3577     +167     
==========================================
+ Hits         2544     2655     +111     
- Misses        866      922      +56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…utputs

cyramic

A few minor comments

cyramic · 2025-01-09T13:21:04Z

liiatools/annex_a_pipeline/spec/__init__.py

+    Load the pipeline config file
+    :return: Parsed pipeline config file
+    """
+    with open(SCHEMA_DIR / "pipeline.json", "rt") as FILE:


Minor nit-pick, but all-caps variables are typically constants that are defined elsewhere (like you did with SCHEMA_DIR. Seeing FILE in all caps makes me think it's one of them, but it isn't. Ideally this variable should be lower case to match its function. See here for more info: https://peps.python.org/pep-0008/#constants

cyramic · 2025-01-09T13:28:52Z

liiatools/annex_a_pipeline/stream_pipeline.py

+    dataset = dataset_holder.value
+    errors = error_holder.value
+
+    logger.info(


Should there be any other logging statements above here to demarcate the different stages instead of just it being complete? Kind of what you have in comments, but make them log statements instead so that it records that the stream has been configured, cleaned, etc.

What I'm keen to ensure is that if things go wrong we have a nice trail to show what has successfully completed and what hasn't. One of the difficulties with dagster is things can be recorded out of order, and thus any statements declaring "made it here" can be helpful when debugging.

cyramic · 2025-01-09T13:30:51Z

liiatools/cin_census_pipeline/spec/__init__.py


 import xmlschema
-
+import yaml


If we're using ruamel.yaml, should we be consistent and stick to only that library when possible? That makes it easier to maintain going forward. Especially in light of what we now know about the pyyaml library.

technically this is not something I have added here - it's come up as a change thanks to Git Hook reformatting. I did think we probably want to change this but that should happen in a separate issue as this one just relates to Annex A pipeline?

cyramic · 2025-01-09T13:47:57Z

liiatools/tests/annex_a/test_stream_filters.py

+    schema = DataSchema(
+        column_map={
+            "list_1": {
+                "Child Unique ID": Column(header_regex=["/.*child.*id.*/i"]),


Is this regex too catch-all? I think it's fine for the most part, but I worry that if a new column name is added that might be unrelated, this could cause unexpected issues such as "Childminder ID" than this would catch it. I'm not sure how likely that would be considering the dataset, but thought I'd flag it just in case.

MichaelHanksSF · 2025-01-10T12:27:19Z

liiatools/ssda903_pipeline/pipeline.py

Why have changes like this happened, where actual variables are being changed? Has this been done by a person or by a formatting tool? If the latter, are we confident that this isn't having any unintended knock-on effect?

MichaelHanksSF · 2025-01-10T12:27:52Z

liiatools/ssda903_pipeline/spec/SSDA903_schema_2017.yml

Why is this here and what are the changes?

MichaelHanksSF · 2025-01-10T15:20:32Z

liiatools/annex_a_pipeline/stream_pipeline.py

+from liiatools.common.spec.__data_schema import DataSchema
+from liiatools.common.stream_pipeline import to_dataframe
+
+from . import stream_filters


Really nit-picky, I personally would rather explicitly say the folder rather than use . here. However, I am happy to be overruled on this by others.

MichaelHanksSF · 2025-01-10T15:36:41Z

liiatools/common/converters.py

Again, probably very insignificant, but why has this happened? The docstring has changed the param names and description, but the function itself remains unchanged.

MichaelHanksSF · 2025-01-10T15:46:05Z

liiatools_pipeline/sensors/job_success_sensor.py

@@ -208,6 +214,9 @@ def move_concat_sensor(context):

    if run_records:  # Ensure there is at least one run record
        context.log.info(f"Run records found for reports job in move concat sensor")
+        if "annex_a" in allowed_datasets:
+            allowed_datasets.remove("annex_a")
+            context.log.info(f"Annex A removed from reports job for move concat sensor")


I think we should move this up before the log statement, otherwise the log will say we'll move annex a but then we won't

Patrick Troy added 30 commits May 7, 2024 14:15

update schemas, add time to Column class

4d07d59

set type on time columns

faf2fca

set type to float

e0cb559

coerce blank time errors

3da2734

update .yml and unit tests

3444b18

run python black

48896a4

update schemas

81ce969

update schema

cab8390

add schema debugging

ae2b83c

remove schema debugging

4d6a651

update 2017 schema

6331e11

add csv reformatter

c7e5713

add csv reformatter

b29b315

Merge branch 'school_census_pipeline' of https://github.com/SocialFin…

8af7385

…anceDigitalLabs/liia-tools into school_census_pipeline

update read_csv in csv reformatter

64b4519

update reformat csv

877871c

update reformat csv

2b0e3c4

add get_headers function

9c23e56

add debug print statements

c1652a9

update 2021 schema, remove debug prints

d4e059c

add debug print statements

85bcf2e

update 2021 schema

e32fb9b

store headers in error_summary for debugging

791069e

update school 2022 schema

d24ba74

add CIN csv pipeline

4e2639a

fix unit tests

9243b49

update school census schema 23 and 24

a6fff80

update school schemas

4cf24fa

update cin csv schemas

381ff91

update cin csv schemas

30a36e2

amynickolls added 15 commits December 6, 2024 18:12

Merge branch '61-annex-a-schema-build' of https://github.com/SocialFi…

498cb48

…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build

Merge branch '61-annex-a-schema-build' of https://github.com/SocialFi…

b70f6c7

…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build

Merge branch '61-annex-a-schema-build' of https://github.com/SocialFi…

ebaadcd

…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build

remove print statements

cf7cdfe

comments

ff65158

schema update

a3f65eb

test checkpoint

41a4af8

run test

4aa056c

Merge branch 'main' into 62-pipeline-development

292c050

Merge branch 'allow_list_codes_yml' into 62-pipeline-development

c8d50a3

pipeline.json added

8b1838a

schema update

0372cad

json edit

f908a2f

remove year input from schema load

739a841

delete build schema and associated tests

0d51ddf

amynickolls marked this pull request as draft December 18, 2024 17:47

update to fix annex_a pipeline

5c90b39

patrick-troy force-pushed the 62-pipeline-development branch from d32dc58 to 5c90b39 Compare December 24, 2024 07:04

amynickolls added 4 commits January 2, 2025 14:35

do not allow "current" and "aggregated" datasets to flow into final o…

1b13d3b

…utputs

schema update

853257b

switch schema load to ruamel

cddb10f

remove whitespace from schema

00dee16

amynickolls changed the base branch from main to annex-a-dagster January 8, 2025 14:00

amynickolls changed the base branch from annex-a-dagster to main January 8, 2025 17:21

removing school census code

eb41037

amynickolls marked this pull request as ready for review January 9, 2025 11:35

amynickolls requested a review from MichaelHanksSF January 9, 2025 11:36

cyramic reviewed Jan 9, 2025

View reviewed changes

MichaelHanksSF reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic Annex A pipeline #68

basic Annex A pipeline #68

amynickolls commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 23, 2024 •

edited

Loading

cyramic left a comment

cyramic Jan 9, 2025

cyramic Jan 9, 2025

cyramic Jan 9, 2025

amynickolls Jan 9, 2025

cyramic Jan 9, 2025

MichaelHanksSF Jan 10, 2025

MichaelHanksSF Jan 10, 2025

MichaelHanksSF Jan 10, 2025

MichaelHanksSF Jan 10, 2025

MichaelHanksSF Jan 10, 2025

basic Annex A pipeline #68

Are you sure you want to change the base?

basic Annex A pipeline #68

Conversation

amynickolls commented Dec 18, 2024 • edited Loading

codecov bot commented Dec 23, 2024 • edited Loading

Codecov Report

cyramic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amynickolls commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 23, 2024 •

edited

Loading