Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to CONSTRUCTION_AGE_BAND, not allowing for unknowns in UPRN and fix heating features #74

Merged
merged 4 commits into from
Jun 7, 2023

Conversation

sofiapinto
Copy link
Contributor

@sofiapinto sofiapinto commented May 26, 2023

Description

This PR fixes the following issues in the EPC processing pipeline:

  • UPRN with value "unknown" #71: with the changes we're not allowing for UPRN column to take the “unknown” value when missing (as it can be misleading since this is an identifier). We also fixed an existing bug when dropping duplicates (when creating EPC_processed_and_deduplicated.csv): previously, all EPC entries with UPRN missing where considered to be the same property when dropping duplicates, although they in fact represent multiple properties. I tried a temporary fix for this bug, so that when UPRN is missing, we now use ADDRESS1, ADDRESS2 and POSTCODE to drop duplicates (there might be a better way to this!). It still needs to be fixed further as per issue Create unique identifier for properties #72, but that’ll be for a next PR (ideally, we would have a true unique identifier for properties and use it when dropping duplicates). Changes live in feature_engineering.py and epc_data.py
  • Enhance CONSTRUCTION_AGE_BAND #68: currently this variable has approximately 10% missing values. We enhance it by filling the missing values with inspection year if transaction type is “new dwelling”. With this enhancement, missing values decreases to 1%. Changes live in feature_engineering.py and data_cleaning.py.
    • Maybe instead of having the enhance_construction_age_band() call in clean_epc_data() in line 520 maybe it could be called in clean_CONSTRUCTION_AGE_BAND() in line 158 – any strong feelings?
    • Can you double check the else if’s in enhance_construction_age_band() make sense? I am only wondering about the last ones which depend on COUNTRY (I don’t think the year needs to be in the if statement at that point, but I left it there since it helps with readability).
  • Issue in pipeline/preprocessing/feature_engineering.py #70: the get_heating_features() function in feature_engineering.py had a bug. We fix this bug (by allowing for None’s to be treated as NaNs) and improve the function by removing the for loop and using .apply() and np.where() – this speeds up the processing pipeline by a few minutes and improves code readability. I double checked the processed data before and after this change and a few values in the variables created by this function are different. The reason is that sometimes MAINHEAT_DESCRIPTION contains multiple applicable values (e.g. “gas” and “electric” for HEATING_FUEL) and depending on the order by which we check the presence of these in MAINHEAT_DESCRIPTION, we either return one or the other. This is something we should fix in the future (issue Deal with multiple heating systems and fuel types for the same EPC record #73) by:
    • Considering all the applicable values (e.g. HEATING_FUEL=”gas and electric”, HEATING_SYSTEM=”boiler and radiators and underfloor heating“);
    • Or by only considering the main value (e.g. main heating fuel for the property is gas, so we consider gas, although the house also has electric heaters);

closes #68
closes #70
closes #71

Instructions for Reviewer(s)

Review

Dear @ch-williamson ,

It would be great if you could review the changes to the following scripts:

  • asf_core_data/pipeline/preprocessing/feature_engineering.py
  • asf_core_data/pipeline/preprocessing/data_cleaning.py
  • asf_core_data/getters/epc/epc_data.py

@sqr00t / @Jack-Vines - tagging you FYI

Setup

In case you want/need to run anything:

  • clone this repo: git clone [email protected]:nestauk/asf_core_data.git
  • checkout to the correct branch: git checkout 70_issue_feature_engineeringpy
  • Run make install;
  • Run direnv allow;
  • Activate the conda enviroment: conda activate asf_core_data;

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained the feature in this PR or (better) in output/reports/
  • I have requested a code review

Copy link
Contributor

@ch-williamson ch-williamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sofia, looks good - have just suggested some little optimisations. Thanks for fixing this!

On your comment re where cleaning the construction age band should live - it might make more sense for it to be part of clean_CONSTRUCTION_AGE_BAND, but I don't feel too strongly about it :)

asf_core_data/getters/epc/epc_data.py Outdated Show resolved Hide resolved
asf_core_data/getters/epc/epc_data.py Outdated Show resolved Hide resolved
asf_core_data/getters/epc/epc_data.py Outdated Show resolved Hide resolved
@sofiapinto sofiapinto merged commit 4ddee1b into dev Jun 7, 2023
@sofiapinto sofiapinto deleted the 70_issue_feature_engineeringpy branch June 7, 2023 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants