Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Location feature for better accuracy #288

Merged
merged 26 commits into from
Mar 27, 2024
Merged

Update Location feature for better accuracy #288

merged 26 commits into from
Mar 27, 2024

Conversation

ajbarnes
Copy link
Contributor

@ajbarnes ajbarnes commented Mar 4, 2024

Users were observing that some VA locations were either not being matched, or worse, being matched to an incorrect location. To investigate, the location logic was pulled into a sandbox and run with test data. With the baseline logic, reporting of these test results (raw province, district, hospital compared against the location assigned to a VA) indicated:

17771 matches (28.34%)
34850 partials (55.58%)
1480 mismatches (2.36%)
4934 wrong (7.87%)
3663 invalid (5.84%)

where

  • match = everything correct
  • partial = ONLY hospital or (less likely) ONLY district/province is wrong
  • mismatch = BOTH district and hospital are wrong or (less likely) province/district or province/hospital are wrong
  • wrong = everything is totally off
  • invalid = at least one value was missing/None/NaN so a comparison could not be made

Looking into the causes, it was decided that the hospital list used to initialize locations within VA Explorer was outdated, the location matching logic should be updated to use exact match instead of fuzzy match now that processes have matured, and that location ingest needed new features to handle hospitals that previously submitted VAs but now only exist historically. Additionally, a closer look at the underlying data revealed a few data quality issues with historical VAs that prevented proper assignment.

After updating the logic for location matching, re-implementing load_locations to handle new hospital lists (ex. active vs inactive column to provide info on if facility should be providing new VAs or not), re-running the sandbox reporting revealed significant improvements:

69797 matches (93.67%)
584 partials (0.78%)
0 mismatches (0.00%)
4135 wrong (5.55%)
0 invalid (0.00%)

further, simulating the proposed fixes to the underlying test data and then using that instead of the raw test data in this sandbox comparison gave an expected 100% match:

74516 matches (100.00%)
0 partials (0.00%)
0 mismatches (0.00%)
0 wrong (0.00%)
0 invalid (0.00%)

At a high level, in this PR:

  • load_locations now takes a simple list of hospitals with supporting metadata (i.e ["province", "district", "key", "name", "status"])
  • load_locations automatically parses all the hospitals out into a tree datastructure and then performs the logic to turn the anytree datastructure into the django-treebeard style tree that we store in our database.
  • Adding a new hospital row to the csv and running load_locations with it adds a new hospital and this creation handles the edge case of duplicate rows by ignoring the duplicate
  • Updating an existing hospital (ex. changing from active to inactive) is as easy as updating the relevant csv row and running load_locations again
  • If a row is deleted from the csv, the hospital associated with it will also be deleted from the database (should we keep deletion?)
  • during VA ingest, at the location assignment step, we no longer use build_location_mapper (and thus 0 fuzzy_match) but use the new Location.key property to create a an exact mapping of the hospital XML value to the actual name straight from the database and the Location.path_string property to handle duplicates (same hospital name in different geographies; example: 'other' hospital existing in multiple places)
  • location assignment now handles ambiguity by searching for a match via the VA.province and VA.area substrings within the new Location.path_string

ajbarnes and others added 25 commits February 9, 2024 14:36
…erties. Remove support for name based export filtering regarding locations since names no longer likely to be unique
…ts in the data model, mark it as inactive instead of deleting
docs/training/admin_guides.md Outdated Show resolved Hide resolved
docs/training/admin_guides.md Outdated Show resolved Hide resolved
docs/training/admin_guides.md Outdated Show resolved Hide resolved
@ajbarnes ajbarnes merged commit 071a7b2 into main Mar 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants