Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improperly formatted dates can cause models to fail #165

Closed
witash opened this issue Sep 24, 2024 · 18 comments · Fixed by #167
Closed

improperly formatted dates can cause models to fail #165

witash opened this issue Sep 24, 2024 · 18 comments · Fixed by #167
Assignees
Labels
Type: Bug Fix something that isn't working as intended

Comments

@witash
Copy link
Contributor

witash commented Sep 24, 2024

improperly formatted dates can cause models to fail

e.g. for the person table

06:59:47    Database Error in model person (models/contacts/person.sql)
  date/time field value out of range: "19/11/1982"
  HINT:  Perhaps you need a different "datestyle" setting.

date fields should be as permissible as possible, and if a truly unformattable date is found, it should default to NULL instead of raising an error, which will break the entire table.

@witash witash added the Type: Bug Fix something that isn't working as intended label Sep 24, 2024
@njuguna-n
Copy link
Contributor

@witash I just opened an issue about the same thing 😃. I'll close mine.

@medic-ci
Copy link

🎉 This issue has been resolved in version 1.2.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@witash
Copy link
Contributor Author

witash commented Sep 24, 2024

this is still failing for moh ke's data

@witash witash reopened this Sep 24, 2024
medic-ci pushed a commit that referenced this issue Sep 24, 2024
## [1.2.3](v1.2.2...v1.2.3) (2024-09-24)

### Bug Fixes

* **#165:** removing potentially ambiguous date format ([26f4da7](26f4da7)), closes [#165](#165)
@medic-ci
Copy link

🎉 This issue has been resolved in version 1.2.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@witash
Copy link
Contributor Author

witash commented Sep 24, 2024

failed again. this time because there are indeed some invalid dates with a format that matches the regex

#170 (comment)

This condition still isn't perfect, because there's a possibility for errors when something is formatted like 1234-56-78 but is not a valid date

e.g. '1980-00-21'
so this regex thing isn't going work, we need a new approach

@witash witash reopened this Sep 24, 2024
@andrablaj andrablaj moved this from Done to In Progress in Product Team Activities Sep 24, 2024
@njuguna-n
Copy link
Contributor

@witash I have a PR that handles this new error.

@lorerod
Copy link
Contributor

lorerod commented Sep 24, 2024

@njuguna-n @witash, what do you think about creating a custom generic test to assert a valid date? And we can use it in every column with the data type.

  1. Ensure the date matches the format YYYY-MM-DD
  2. Invalid when TO_DATE conversion doesn't match
  3. Invalid for a future date? (this will depend on the column)
  4. Special case for leap years: ensure Feb 29 only exists on leap years
  5. Validate the day range based on the month (to avoid dates like "2023-02-31")

I will create a PR and work on it.

@witash
Copy link
Contributor Author

witash commented Sep 25, 2024

@njuguna-n I don't know if you had any other thoughts on this, but I tried to come up with a solution that avoids all edge cases and it was surprisingly difficult...the only way I could think of was to do add the month and leap year logic as cases statements to the query. We could move it to a dbt macro to make the models less messy, but it still just adds a lot of mess to the query.

So maybe this PR is actually what we want long term; a simple regex to weed out anything too weird, and then if someone really wants to add February 31st, its going to break the model, and someone has to fix it.

@lorerod tests for this would be great, and would also help them fix the data; for this issue we're mainly concerned with just getting the modle to run, but for those improperly formatted dates that its able to handle, they're just going to be NULL in the model, not fixed.

Also, for MoH KE, I checked and they don't have any invalid dates apart from the one that was breaking the previous fix, so it should be ok

cht_sync_db=# SELECT
cht_sync_db-#   count(*),
cht_sync_db-#   CASE WHEN couchdb.doc->>'date_of_birth' ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
cht_sync_db-#   THEN
cht_sync_db-#     (CASE
cht_sync_db(#         WHEN -- Extract year, month, and day using substring
cht_sync_db(#             -- Check for invalid months or days
cht_sync_db(#             ((substring(couchdb.doc->>'date_of_birth', 6, 2))::int NOT BETWEEN 1 AND 12)
cht_sync_db(#             OR ((substring(couchdb.doc->>'date_of_birth', 9, 2))::int < 1)
cht_sync_db(#         THEN 'invalid_month'
cht_sync_db(#
cht_sync_db(#         WHEN -- Step 3: Check for valid day of month based on the month
cht_sync_db(#             (substring(couchdb.doc->>'date_of_birth', 6, 2) IN ('01', '03', '05', '07', '08', '10', '12')
cht_sync_db(#             AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 31)
cht_sync_db(#             OR (substring(couchdb.doc->>'date_of_birth', 6, 2) IN ('04', '06', '09', '11')
cht_sync_db(#             AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 30)
cht_sync_db(#         THEN 'valid'
cht_sync_db(#
cht_sync_db(#         WHEN -- Step 4: Check for February (leap year or not)
cht_sync_db(#             (substring(couchdb.doc->>'date_of_birth', 6, 2) = '02')
cht_sync_db(#             THEN
cht_sync_db(#                 CASE
cht_sync_db(#                     -- Handle leap years
cht_sync_db(#                     WHEN (substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 4 = 0
cht_sync_db(#                         AND ((substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 100 != 0
cht_sync_db(#                         OR (substring(couchdb.doc->>'date_of_birth', 1, 4))::int % 400 = 0)
cht_sync_db(#                         AND (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 29
cht_sync_db(#                     THEN 'valid'
cht_sync_db(#
cht_sync_db(#                     -- Not a leap year
cht_sync_db(#                     WHEN (substring(couchdb.doc->>'date_of_birth', 9, 2))::int <= 28
cht_sync_db(#                     THEN 'valid'
cht_sync_db(#
cht_sync_db(#                     ELSE 'invalid_date_leap_year'
cht_sync_db(#                 END
cht_sync_db(#         ELSE 'invalid_date'
cht_sync_db(#     END)
cht_sync_db-#   ELSE 'invalid_format'
cht_sync_db-#   END as valid_date
cht_sync_db-# from v1.contact
cht_sync_db-# inner join v1.medic couchdb on _id = uuid
cht_sync_db-# where
cht_sync_db-#   couchdb.doc->>'date_of_birth' IS NOT NULL
cht_sync_db-#   AND couchdb.doc->>'date_of_birth' <> ''
cht_sync_db-# group by valid_date
cht_sync_db-# ;

  count   |   valid_date
----------+----------------
        1 | invalid_date
     4206 | invalid_format
 26311747 | valid

@njuguna-n
Copy link
Contributor

@witash yes, I have also been looking into it and was also surprised that it is not as straightforward to implement. Having invalid dates break the models is not ideal but I agree that the goal right now is to get the models building and the dashboards up to date.

@dianabarsan
Copy link
Member

I think we should not overcomplicate the base model.
Is there a way to alert on models not getting built or failing?

I'm thinking on the long term here, where some project "breaks" their data structure in such a way that models start failing.
If we silently ignore errors (and set fields as NULL), seeing that the model "broke" needs someone with a keen eye noticing data inconsistencies in the dashboards.

Another question I have about detecting date-like strings is whether macros can get overloaded. Say we add a default date transform macro to pipeline, but kenya decides they really have four types of date formats and want to accommodate to that. Can they write their own macro that gets used instead of the default macro?

@lorerod
Copy link
Contributor

lorerod commented Sep 25, 2024

Can they write their own macro that gets used instead of the default macro?

I was wondering about this, too. I will implement some test validations for the type of date format we are handling, but others may have different formats.

Thinking about the long-term solution. The ideal should be to clean or fix the data early in the pipeline. Setting fields to NULL is only hiding the problem. But do we want the entire build to fail?
An idea could be to redirect invalid data to a separate model, like a quarantine model, for review and cleaning up.
But this has the same problem with overcomplicating the model with validations to redirect the data or not. It also adds maintenance costs to the quarantine model to periodically clean it.

@witash
Copy link
Contributor Author

witash commented Sep 26, 2024

Can they write their own macro that gets used instead of the default macro?

This is not really possible. If there is a need to change or extend the base models, or a macro in cht-pipeline, it can't be changed directly, instead copied to a new model or macro.

@njuguna-n
Copy link
Contributor

I think we should not be doing any date casting in the base models and should leave that to the project specific models. Date formats are too varied and imposing one or a few in the base models would be cumbersome to override as Tom pointed out above.

Right now we are only doing it in the person model for date_of_birth column and removing it would be a breaking change for other MoH KE models. I am exploring where that column is being used and possibilities of a refactor

@witash
Copy link
Contributor Author

witash commented Sep 26, 2024

Yea ok that makes sense. Ideally something as simple and universal as date of birth should be in base models, and the idea also was to align person and place models with the API as much as possible, but I agree that its going to be too error prone and complicated long term.

@njuguna-n
Copy link
Contributor

Removing the column is not a breaking change for MoH KE models the date_of_birth column used is one defined in the patient_f_client model so I have created a PR removing the column.

The macro below seems to work well for the YYYY-MM-DD format expected in the patient_f_client model 🤞

{% macro validate_date(date_column) %}
  (
    CASE
      -- First check if the date_column is in the correct format 'YYYY-MM-DD'
      WHEN {{ date_column }} ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
      THEN
        CASE
          -- Handle months with 30 days (April, June, September, November)
          WHEN SUBSTRING({{ date_column }} FROM 6 FOR 2) IN ('04', '06', '09', '11') 
            AND SUBSTRING({{ date_column }} FROM 9 FOR 2)::int <= 30
          THEN to_date({{ date_column }}, 'YYYY-MM-DD')
          
          -- Handle February, leap year and non-leap year cases
          WHEN SUBSTRING({{ date_column }} FROM 6 FOR 2) = '02'
          THEN
            CASE
              -- Check for leap year: divisible by 4, but not by 100 unless also divisible by 400
              WHEN (SUBSTRING({{ date_column }} FROM 1 FOR 4)::int % 4 = 0 
                    AND (SUBSTRING({{ date_column }} FROM 1 FOR 4)::int % 100 != 0 
                         OR SUBSTRING({{ date_column }} FROM 1 FOR 4)::int % 400 = 0))
                    AND SUBSTRING({{ date_column }} FROM 9 FOR 2)::int <= 29
              THEN to_date({{ date_column }}, 'YYYY-MM-DD')
              -- Non-leap year case: February can have only up to 28 days
              WHEN SUBSTRING({{ date_column }} FROM 9 FOR 2)::int <= 28
              THEN to_date({{ date_column }}, 'YYYY-MM-DD')
              ELSE NULL  -- Invalid day in February
            END
          
          -- Handle months with 31 days (January, March, May, July, August, October, December)
          WHEN SUBSTRING({{ date_column }} FROM 6 FOR 2) IN ('01', '03', '05', '07', '08', '10', '12') 
            AND SUBSTRING({{ date_column }} FROM 9 FOR 2)::int <= 31
          THEN to_date({{ date_column }}, 'YYYY-MM-DD')
          ELSE NULL
        END
      ELSE
        NULL
    END
  )
{% endmacro %}

@njuguna-n
Copy link
Contributor

The macro worked well for the patient_f_client model. I am waiting for the latest run to complete without errors to confirm that removing the date_of_birth column from the person column did not break anything in the MoH KE models (I tested locally but just want to be sure) and then I can close this issue.

11:58:47  1 of 1 START sql incremental model v1.patient_f_client ......................... [RUN]
12:34:11  1 of 1 OK created sql incremental model v1.patient_f_client .................... [SELECT 26137682 in 2124.47s]

@lorerod
Copy link
Contributor

lorerod commented Sep 26, 2024

Ok, I will cancel the work on the test for these. I was implementing something similar to the macro shared by @njuguna-n. I'm confident we can develop a way of validating dates and configuring them depending on date formats. Implementing this in base models may be important so it can be used as a reference for other projects.

@njuguna-n
Copy link
Contributor

All models being built successfully after this PR merge.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Product Team Activities Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Fix something that isn't working as intended
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants