Enhancements to CONSTRUCTION_AGE_BAND, not allowing for unknowns in UPRN and fix heating features #74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes the following issues in the EPC processing pipeline:
UPRN
with value "unknown" #71: with the changes we're not allowing forUPRN
column to take the“unknown”
value when missing (as it can be misleading since this is an identifier). We also fixed an existing bug when dropping duplicates (when creatingEPC_processed_and_deduplicated.csv
): previously, all EPC entries withUPRN
missing where considered to be the same property when dropping duplicates, although they in fact represent multiple properties. I tried a temporary fix for this bug, so that whenUPRN
is missing, we now useADDRESS1
,ADDRESS2
andPOSTCODE
to drop duplicates (there might be a better way to this!). It still needs to be fixed further as per issue Create unique identifier for properties #72, but that’ll be for a next PR (ideally, we would have a true unique identifier for properties and use it when dropping duplicates). Changes live infeature_engineering.py
andepc_data.py
CONSTRUCTION_AGE_BAND
#68: currently this variable has approximately 10% missing values. We enhance it by filling the missing values with inspection year if transaction type is “new dwelling”. With this enhancement, missing values decreases to 1%. Changes live infeature_engineering.py
anddata_cleaning.py
.enhance_construction_age_band()
call inclean_epc_data()
in line 520 maybe it could be called inclean_CONSTRUCTION_AGE_BAND()
in line 158 – any strong feelings?enhance_construction_age_band()
make sense? I am only wondering about the last ones which depend onCOUNTRY
(I don’t think the year needs to be in the if statement at that point, but I left it there since it helps with readability).pipeline/preprocessing/feature_engineering.py
#70: theget_heating_features()
function infeature_engineering.py
had a bug. We fix this bug (by allowing forNone
’s to be treated asNaN
s) and improve the function by removing the for loop and using.apply()
andnp.where()
– this speeds up the processing pipeline by a few minutes and improves code readability. I double checked the processed data before and after this change and a few values in the variables created by this function are different. The reason is that sometimesMAINHEAT_DESCRIPTION
contains multiple applicable values (e.g. “gas” and “electric” forHEATING_FUEL
) and depending on the order by which we check the presence of these inMAINHEAT_DESCRIPTION
, we either return one or the other. This is something we should fix in the future (issue Deal with multiple heating systems and fuel types for the same EPC record #73) by:HEATING_FUEL
=”gas and electric”,HEATING_SYSTEM
=”boiler and radiators and underfloor heating“);closes #68
closes #70
closes #71
Instructions for Reviewer(s)
Review
Dear @ch-williamson ,
It would be great if you could review the changes to the following scripts:
asf_core_data/pipeline/preprocessing/feature_engineering.py
asf_core_data/pipeline/preprocessing/data_cleaning.py
asf_core_data/getters/epc/epc_data.py
@sqr00t / @Jack-Vines - tagging you FYI
Setup
In case you want/need to run anything:
git clone [email protected]:nestauk/asf_core_data.git
git checkout 70_issue_feature_engineeringpy
make install
;direnv allow
;conda activate asf_core_data
;Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
soutput/reports/