Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic Annex A pipeline #68

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

basic Annex A pipeline #68

wants to merge 66 commits into from

Conversation

amynickolls
Copy link

@amynickolls amynickolls commented Dec 18, 2024

produces a cleanfile output of 1 csv per annex a list
create a reports output for each list for the region
make the usual logs and outputs (clean, concat, reports) available in the standard places in the infrastructure
do not allow "current" and "aggregated" datasets to flow into final outputs

@amynickolls amynickolls marked this pull request as draft December 18, 2024 17:47
Copy link

codecov bot commented Dec 23, 2024

Codecov Report

Attention: Patch coverage is 72.46964% with 68 lines in your changes missing coverage. Please review.

Project coverage is 74.22%. Comparing base (3ca4f7b) to head (fcbcac8).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
liiatools/common/stream_filters.py 48.48% 17 Missing ⚠️
liiatools/annex_a_pipeline/stream_pipeline.py 36.00% 16 Missing ⚠️
liiatools/common/pipeline.py 33.33% 10 Missing ⚠️
liiatools/common/converters.py 30.76% 9 Missing ⚠️
liiatools/common/stream_pipeline.py 20.00% 4 Missing ⚠️
liiatools/annex_a_pipeline/spec/__init__.py 85.71% 3 Missing ⚠️
...iiatools/ssda903_pipeline/sufficiency_transform.py 0.00% 3 Missing ⚠️
liiatools/common/_transform_functions.py 60.00% 2 Missing ⚠️
liiatools/common/spec/__data_schema.py 50.00% 2 Missing ⚠️
liiatools/ssda903_pipeline/pipeline.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #68      +/-   ##
==========================================
- Coverage   74.60%   74.22%   -0.38%     
==========================================
  Files          60       66       +6     
  Lines        3410     3577     +167     
==========================================
+ Hits         2544     2655     +111     
- Misses        866      922      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@patrick-troy patrick-troy force-pushed the 62-pipeline-development branch from d32dc58 to 5c90b39 Compare December 24, 2024 07:04
@amynickolls amynickolls changed the base branch from main to annex-a-dagster January 8, 2025 14:00
@amynickolls amynickolls changed the base branch from annex-a-dagster to main January 8, 2025 17:21
@amynickolls amynickolls marked this pull request as ready for review January 9, 2025 11:35
Copy link

@cyramic cyramic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments

Load the pipeline config file
:return: Parsed pipeline config file
"""
with open(SCHEMA_DIR / "pipeline.json", "rt") as FILE:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit-pick, but all-caps variables are typically constants that are defined elsewhere (like you did with SCHEMA_DIR. Seeing FILE in all caps makes me think it's one of them, but it isn't. Ideally this variable should be lower case to match its function. See here for more info: https://peps.python.org/pep-0008/#constants

dataset = dataset_holder.value
errors = error_holder.value

logger.info(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any other logging statements above here to demarcate the different stages instead of just it being complete? Kind of what you have in comments, but make them log statements instead so that it records that the stream has been configured, cleaned, etc.

What I'm keen to ensure is that if things go wrong we have a nice trail to show what has successfully completed and what hasn't. One of the difficulties with dagster is things can be recorded out of order, and thus any statements declaring "made it here" can be helpful when debugging.


import xmlschema

import yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using ruamel.yaml, should we be consistent and stick to only that library when possible? That makes it easier to maintain going forward. Especially in light of what we now know about the pyyaml library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this is not something I have added here - it's come up as a change thanks to Git Hook reformatting. I did think we probably want to change this but that should happen in a separate issue as this one just relates to Annex A pipeline?

schema = DataSchema(
column_map={
"list_1": {
"Child Unique ID": Column(header_regex=["/.*child.*id.*/i"]),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this regex too catch-all? I think it's fine for the most part, but I worry that if a new column name is added that might be unrelated, this could cause unexpected issues such as "Childminder ID" than this would catch it. I'm not sure how likely that would be considering the dataset, but thought I'd flag it just in case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have changes like this happened, where actual variables are being changed? Has this been done by a person or by a formatting tool? If the latter, are we confident that this isn't having any unintended knock-on effect?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here and what are the changes?

from liiatools.common.spec.__data_schema import DataSchema
from liiatools.common.stream_pipeline import to_dataframe

from . import stream_filters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nit-picky, I personally would rather explicitly say the folder rather than use . here. However, I am happy to be overruled on this by others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, probably very insignificant, but why has this happened? The docstring has changed the param names and description, but the function itself remains unchanged.

@@ -208,6 +214,9 @@ def move_concat_sensor(context):

if run_records: # Ensure there is at least one run record
context.log.info(f"Run records found for reports job in move concat sensor")
if "annex_a" in allowed_datasets:
allowed_datasets.remove("annex_a")
context.log.info(f"Annex A removed from reports job for move concat sensor")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this up before the log statement, otherwise the log will say we'll move annex a but then we won't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants