-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic Annex A pipeline #68
base: main
Are you sure you want to change the base?
Conversation
…anceDigitalLabs/liia-tools into school_census_pipeline
…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build
…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build
…nanceDigitalLabs/liia-tools-pipeline into 61-annex-a-schema-build
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #68 +/- ##
==========================================
- Coverage 74.60% 74.22% -0.38%
==========================================
Files 60 66 +6
Lines 3410 3577 +167
==========================================
+ Hits 2544 2655 +111
- Misses 866 922 +56 ☔ View full report in Codecov by Sentry. |
d32dc58
to
5c90b39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments
Load the pipeline config file | ||
:return: Parsed pipeline config file | ||
""" | ||
with open(SCHEMA_DIR / "pipeline.json", "rt") as FILE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit-pick, but all-caps variables are typically constants that are defined elsewhere (like you did with SCHEMA_DIR. Seeing FILE in all caps makes me think it's one of them, but it isn't. Ideally this variable should be lower case to match its function. See here for more info: https://peps.python.org/pep-0008/#constants
dataset = dataset_holder.value | ||
errors = error_holder.value | ||
|
||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be any other logging statements above here to demarcate the different stages instead of just it being complete? Kind of what you have in comments, but make them log statements instead so that it records that the stream has been configured, cleaned, etc.
What I'm keen to ensure is that if things go wrong we have a nice trail to show what has successfully completed and what hasn't. One of the difficulties with dagster is things can be recorded out of order, and thus any statements declaring "made it here" can be helpful when debugging.
|
||
import xmlschema | ||
|
||
import yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using ruamel.yaml, should we be consistent and stick to only that library when possible? That makes it easier to maintain going forward. Especially in light of what we now know about the pyyaml library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this is not something I have added here - it's come up as a change thanks to Git Hook reformatting. I did think we probably want to change this but that should happen in a separate issue as this one just relates to Annex A pipeline?
schema = DataSchema( | ||
column_map={ | ||
"list_1": { | ||
"Child Unique ID": Column(header_regex=["/.*child.*id.*/i"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this regex too catch-all? I think it's fine for the most part, but I worry that if a new column name is added that might be unrelated, this could cause unexpected issues such as "Childminder ID" than this would catch it. I'm not sure how likely that would be considering the dataset, but thought I'd flag it just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have changes like this happened, where actual variables are being changed? Has this been done by a person or by a formatting tool? If the latter, are we confident that this isn't having any unintended knock-on effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here and what are the changes?
from liiatools.common.spec.__data_schema import DataSchema | ||
from liiatools.common.stream_pipeline import to_dataframe | ||
|
||
from . import stream_filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nit-picky, I personally would rather explicitly say the folder rather than use . here. However, I am happy to be overruled on this by others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, probably very insignificant, but why has this happened? The docstring has changed the param names and description, but the function itself remains unchanged.
@@ -208,6 +214,9 @@ def move_concat_sensor(context): | |||
|
|||
if run_records: # Ensure there is at least one run record | |||
context.log.info(f"Run records found for reports job in move concat sensor") | |||
if "annex_a" in allowed_datasets: | |||
allowed_datasets.remove("annex_a") | |||
context.log.info(f"Annex A removed from reports job for move concat sensor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this up before the log statement, otherwise the log will say we'll move annex a but then we won't
produces a cleanfile output of 1 csv per annex a list
create a reports output for each list for the region
make the usual logs and outputs (clean, concat, reports) available in the standard places in the infrastructure
do not allow "current" and "aggregated" datasets to flow into final outputs