-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improperly formatted dates can cause models to fail #165
Comments
🎉 This issue has been resolved in version 1.2.2 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
this is still failing for moh ke's data |
🎉 This issue has been resolved in version 1.2.3 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
failed again. this time because there are indeed some invalid dates with a format that matches the regex
e.g. '1980-00-21' |
@njuguna-n @witash, what do you think about creating a custom generic test to assert a valid date? And we can use it in every column with the data type.
I will create a PR and work on it. |
@njuguna-n I don't know if you had any other thoughts on this, but I tried to come up with a solution that avoids all edge cases and it was surprisingly difficult...the only way I could think of was to do add the month and leap year logic as cases statements to the query. We could move it to a dbt macro to make the models less messy, but it still just adds a lot of mess to the query. So maybe this PR is actually what we want long term; a simple regex to weed out anything too weird, and then if someone really wants to add February 31st, its going to break the model, and someone has to fix it. @lorerod tests for this would be great, and would also help them fix the data; for this issue we're mainly concerned with just getting the modle to run, but for those improperly formatted dates that its able to handle, they're just going to be NULL in the model, not fixed. Also, for MoH KE, I checked and they don't have any invalid dates apart from the one that was breaking the previous fix, so it should be ok cht_sync_db=# SELECT
cht_sync_db-# count(*),
cht_sync_db-# CASE WHEN couchdb.doc->>'date_of_birth' ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
cht_sync_db-# THEN
cht_sync_db-# (CASE
cht_sync_db(# WHEN -- Extract year, month, and day using substring
cht_sync_db(# -- Check for invalid months or days
cht_sync_db(# ((substring(couchdb.doc->>'date_of_birth', 6, 2))::int NOT BETWEEN 1 AND 12)
cht_sync_db(# OR ((substring(couchdb.doc->>'date_of_birth', 9, 2))::int < 1)
cht_sync_db(# THEN 'invalid_month'
cht_sync_db(#
cht_sync_db(# WHEN -- Step 3: Check for valid day of month based on the month
cht_sync_db(# (substring(couchdb.doc->>'date_of_birth', 6, 2) IN ('01', '03', '05', '07', '08', '10', '12')
cht_sync_db(# AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 31)
cht_sync_db(# OR (substring(couchdb.doc->>'date_of_birth', 6, 2) IN ('04', '06', '09', '11')
cht_sync_db(# AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 30)
cht_sync_db(# THEN 'valid'
cht_sync_db(#
cht_sync_db(# WHEN -- Step 4: Check for February (leap year or not)
cht_sync_db(# (substring(couchdb.doc->>'date_of_birth', 6, 2) = '02')
cht_sync_db(# THEN
cht_sync_db(# CASE
cht_sync_db(# -- Handle leap years
cht_sync_db(# WHEN (substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 4 = 0
cht_sync_db(# AND ((substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 100 != 0
cht_sync_db(# OR (substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 400 = 0)
cht_sync_db(# AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 29
cht_sync_db(# THEN 'valid'
cht_sync_db(#
cht_sync_db(# -- Not a leap year
cht_sync_db(# WHEN (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 28
cht_sync_db(# THEN 'valid'
cht_sync_db(#
cht_sync_db(# ELSE 'invalid_date_leap_year'
cht_sync_db(# END
cht_sync_db(# ELSE 'invalid_date'
cht_sync_db(# END)
cht_sync_db-# ELSE 'invalid_format'
cht_sync_db-# END as valid_date
cht_sync_db-# from v1.contact
cht_sync_db-# inner join v1.medic couchdb on _id = uuid
cht_sync_db-# where
cht_sync_db-# couchdb.doc->>'date_of_birth' IS NOT NULL
cht_sync_db-# AND couchdb.doc->>'date_of_birth' <> ''
cht_sync_db-# group by valid_date
cht_sync_db-# ;
count | valid_date
----------+----------------
1 | invalid_date
4206 | invalid_format
26311747 | valid
|
@witash yes, I have also been looking into it and was also surprised that it is not as straightforward to implement. Having invalid dates break the models is not ideal but I agree that the goal right now is to get the models building and the dashboards up to date. |
I think we should not overcomplicate the base model. I'm thinking on the long term here, where some project "breaks" their data structure in such a way that models start failing. Another question I have about detecting date-like strings is whether macros can get overloaded. Say we add a default date transform macro to pipeline, but kenya decides they really have four types of date formats and want to accommodate to that. Can they write their own macro that gets used instead of the default macro? |
I was wondering about this, too. I will implement some test validations for the type of date format we are handling, but others may have different formats. Thinking about the long-term solution. The ideal should be to clean or fix the data early in the pipeline. Setting fields to NULL is only hiding the problem. But do we want the entire build to fail? |
This is not really possible. If there is a need to change or extend the base models, or a macro in cht-pipeline, it can't be changed directly, instead copied to a new model or macro. |
I think we should not be doing any date casting in the base models and should leave that to the project specific models. Date formats are too varied and imposing one or a few in the base models would be cumbersome to override as Tom pointed out above. Right now we are only doing it in the |
Yea ok that makes sense. Ideally something as simple and universal as date of birth should be in base models, and the idea also was to align person and place models with the API as much as possible, but I agree that its going to be too error prone and complicated long term. |
Removing the column is not a breaking change for MoH KE models the The macro below seems to work well for the
|
The macro worked well for the patient_f_client model. I am waiting for the latest run to complete without errors to confirm that removing the date_of_birth column from the person column did not break anything in the MoH KE models (I tested locally but just want to be sure) and then I can close this issue.
|
Ok, I will cancel the work on the test for these. I was implementing something similar to the macro shared by @njuguna-n. I'm confident we can develop a way of validating dates and configuring them depending on date formats. Implementing this in base models may be important so it can be used as a reference for other projects. |
All models being built successfully after this PR merge. |
improperly formatted dates can cause models to fail
e.g. for the person table
date fields should be as permissible as possible, and if a truly unformattable date is found, it should default to NULL instead of raising an error, which will break the entire table.
The text was updated successfully, but these errors were encountered: