Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

65 reweighting Scottish EPC records #90

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Nov 29, 2024

Fixes #65 and #89

Description

Add reweighting of Scottish EPC records into pipeline. Summary of changes:

  • config/base.yaml - add new Scottish census datasets for reweighting
  • get_target.py - add new getters to get Scottish target data for reweighting. (for property type and tenure but not build year)
  • prepare_target.py - use new getter to prepare target dataset for reweighting
  • prepare_sample.py - changes to shorten category names in property type feature
  • run_compute_epc_weights.py - update run script to reweight Scotland with 2 features and England and Wales with 3
  • reweight_epc.py - add TODO note

NB: information about data sources will be added to the README in a separate PR.

Instructions for Reviewer

To run code:
python -i asf_heat_pump_suitability/pipeline/run_scripts/run_compute_epc_weights.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4

NB: The whole pipeline will take a few hours to run so please could you just test to make sure it works for Scotland and then you can interrupt the run.

Here are the latest outputs of the pipeline which you can load and check instead of running it all through:
Weights: s3://asf-heat-pump-suitability/outputs/2023Q4/20241129_2023_Q4_EPC_weights.parquet
Stats: s3://asf-heat-pump-suitability/outputs/2023Q4/20241129_2023_Q4_EPC_weights_stats.parquet

Please pay special attention to ...

  • Reweighting has been applied correctly to each nation in the code. I.e. Scotland is reweighted appropriately on property_type and tenure features only, whereas England and Wales should also be reweighted on build_year
  • Joins to make sure they are correct and we are not accidentally losing data we should keep. My intention is to retain all the rows that get assigned a weight, but not to keep the unweighted ones. The idea is to join this dataset back onto the full EPC dataset in a separate processing script.
  • The preprocessing steps in run_compute_epc_weights.py before reweighting occurs to ensure they are applied in the appropriate order before and after filtering for country
  • Feel free to suggest places where we can add test/assertions to ensure the pipeline is executing the expected behaviours. I'd like to look at this properly myself later and add some in a separate PR.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@crispy-wonton crispy-wonton marked this pull request as ready for review November 29, 2024 17:46
Copy link
Collaborator

@helloaidank helloaidank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Roisín,

Great piece of coding! 👍
I can imagine it must have been mildly annoying to create two separate sets of getter functions for EW and Scottish data, but you've done a good job. Thank you for clearly documenting your expectations in the PR—those guidelines were really helpful to follow! 👍

Minor Suggestions:

  • Config File: A few tweaks to improve clarity.
  • Documentation: Some additional details could make the code easier to understand for others.
  • Assertions: I’ve added a couple of suggestions.

Output Review:
I checked the output files and performed a sanity check on some of the results. Everything looks fine for both datasets! However, I noticed we lost a total of 23,801 rows, which averages out to about 0.5 rows per LSOA. This doesn’t seem significant but might be worth exploring further to understand why we're losing those rows. That said, it's more of a second/third-order concern at this stage.

Performance:
The processing times range from 0.05 seconds to 6.9 seconds, which seems reasonable. It's expected that some LSOAs take longer due to differences in their distribution.

  • Reweighting has been applied correctly to each nation in the code. I.e. Scotland is reweighted appropriately on property_type and tenure features only, whereas England and Wales should also be reweighted on build_year Checked this, seems to be good.
  • Joins to make sure they are correct and we are not accidentally losing data we should keep. My intention is to retain all the rows that get assigned a weight, but not to keep the unweighted ones. The idea is to join this dataset back onto the full EPC dataset in a separate processing script. I double checked the joins and the logic, it checks out
  • The preprocessing steps in run_compute_epc_weights.py before reweighting occurs to ensure they are applied in the appropriate order before and after filtering for country Preprocessing steps seem to be in the correct order.
  • Feel free to suggest places where we can add test/assertions to ensure the pipeline is executing the expected behaviours. I'd like to look at this properly myself later and add some in a separate PR. I've added some suggestions!

Hope this is helpful!

@@ -2,8 +2,8 @@ data_source:
gb_ons_postcode_dir_url: "https://www.arcgis.com/sharing/rest/content/items/487a5ba62c8b4da08f01eb3c08e304f6/data" # Aug 2023 data
gb_ons_postcode_dir_file_path: "Data/ONSPD_AUG_2023_UK.csv" # Aug 2023 data
UK_ons_postcode_dir: "s3://asf-heat-pump-suitability/source_data/ONSPD_AUG_2023_UK.csv"
EW_census_housing_characteristics: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023update_housing_characteristics_E_W.xlsx" # 2021 census, Mar 2023 update
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, why did we get rid of this in case I've misunderstood? It's no longer relevant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I searched in the repo for references to it and it isn't used anywhere so I deleted it.

@@ -48,25 +49,6 @@ data_source:
EW_inspire_url: "https://use-land-property-data.service.gov.uk/datasets/inspire/download"
S_scottish_gov_DZ2011_boundaries: "s3://asf-heat-pump-suitability/source_data/2014_Scottish_Government_DataZoneBoundaries_2011_S/SG_DataZone_Bdry_2011.shp"
S_NRScotland_dwellings: "s3://asf-heat-pump-suitability/source_data/June2024_NRScotland_households_and_dwellings_S.xlsx"
usecols:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where has this gone? or we just don't use it anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally used in run_compute_epc_weights.py to load the EPC data with these columns. I thought it was easier and more readable to change the columns in the script itself, so it's moved there now. Let me know if you think it should be returned.

@@ -2,8 +2,8 @@ data_source:
gb_ons_postcode_dir_url: "https://www.arcgis.com/sharing/rest/content/items/487a5ba62c8b4da08f01eb3c08e304f6/data" # Aug 2023 data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we can clean up all these links up with something like

data_source:
  # ONS Postcode Directory
  gb_ons_postcode_dir:
    url: "https://www.arcgis.com/sharing/rest/content/items/487a5ba62c8b4da08f01eb3c08e304f6/data" # Aug 2023 data
    file_path: "Data/ONSPD_AUG_2023_UK.csv" # Aug 2023 data
    s3_path: "s3://asf-heat-pump-suitability/source_data/ONSPD_AUG_2023_UK.csv"

  # Census Data
  census:
    EW_tenure: "s3://asf-heat-pump-suitability/source_data/2021census_2023Mar_tenure_E_W.csv"
    S_tenure: "s3://asf-heat-pump-suitability/source_data/2022_Scotlands_census_tenure_S.csv"
    EW_number_of_rooms: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_number_of_rooms_E_W.csv"
    EW_number_of_households: "s3://asf-heat-pump-suitability/source_data/2021_vMar2023_census_numberofhouseholds_EW.csv"
    EW_land_area: "s3://asf-heat-pump-suitability/source_data/2021_vMar2021_census_landareaKM_EW.csv"
    EW_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_accommodation_type_E_W.csv"
    S_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2022_Scotlands_census_accommodation_type_S.csv"

  # Dwelling Age Data
  dwelling_age:
    EW_cdrc: "s3://asf-heat-pump-suitability/source_data/2015cdrc_dwelling_ages_E_W.csv"
    EW_cdrc_la: "s3://asf-heat-pump-suitability/source_data_minor_edits/2015cdrc_dwelling_ages_E_W_per_la_02.parquet"

  # Garden Space Access
  garden_space_access:
    GB_ons: "s3://asf-heat-pump-suitability/source_data/ONS_Apr2020_access_to_garden_space.xlsx"

  # UPRN Data
  uprn:
    GB_osopen: "s3://asf-heat-pump-suitability/source_data/osopenuprn_202405_csv.zip"

  # Local Authority Districts Boundaries
  lad_bounds:
    UK_ons: "s3://asf-heat-pump-suitability/source_data/Local_Authority_Districts_December_2023_Boundaries_UK_BFE_-2600600853110041429/LAD_DEC_2023_UK_BFE.shp"

  # Inspire Land Extent
  inspire_land_extent:
    EW: "s3://asf-heat-pump-suitability/source_data/inspire_ew/"
    S: "s3://asf-heat-pump-suitability/source_data/inspire_scotland/"

  # Microsoft Building Footprint Links
  microsoft_building_footprint:
    global: "https://minedbuildings.z5.web.core.windows.net/global-buildings/dataset-links.csv"

  # Conservation Areas
  conservation_areas:
    E_historic_england: "s3://asf-heat-pump-suitability/source_data/Aug2024_historic_england_conservation_areas_E.geojson"
    W_welsh_gov: "s3://asf-heat-pump-suitability/source_data/2022_welsh_gov_building_conservation_areas_W.gpkg"

  # Off Gas Grid
  off_gas_grid:
    UK_spa: "s3://asf-heat-pump-suitability/source_data/2024_vMar2024_SPA_offgaspostcode_UK.xlsx"

  # Listed Buildings
  listed_buildings:
    E_historic_england: "s3://asf-heat-pump-suitability/source_data/Jun2024_vJul2024_HistoricEngland_listedbuilding_E.gpkg"
    W_cadw: "s3://asf-heat-pump-suitability/source_data/May2024_vMay2024_Cadw_listedbuilding_W.gpkg"
    S_scottish_gov: "s3://asf-heat-pump-suitability/source_data/lb_scotland/Listed_Buildings.shp"

  # LSOA to LAD Lookup
  lsoa_lad_lookup:
    EW_ons: "s3://asf-heat-pump-suitability/source_data/2021_vApr2023_ons_lsoa_to_lad_lookup_EW.csv"

  # LSOA Boundaries
  lsoa_bounds:
    EW: "s3://asf-heat-pump-suitability/source_data/Lower_layer_Super_Output_Areas_2021_EW_BFE_V9_-9107090204806789093/LSOA_2021_EW_BFE_V9.shp"

  # Points of Interest Locations
  poi_locations:
    UK: "s3://asf-heat-pump-suitability/source_data/poi_uk.gpkg"

  # Grid Operators Data
  grid_operators:
    E_ENW:
      dfes_primaries: "s3://asf-heat-pump-suitability/source_data/grid_operators/ENW/dfes-2023-primary-data0.parquet"
      ndp_headroom: "s3://asf-heat-pump-suitability/source_data/grid_operators/ENW/ndp-pry-bsp-headroom.parquet"
      ndp_voronoi: "s3://asf-heat-pump-suitability/source_data/grid_operators/ENW/ndp-pry-voronoi.parquet"
    E_NPg:
      heatmap: "s3://asf-heat-pump-suitability/source_data/grid_operators/NPg/heatmapdemanddata.parquet"
      ndp_demand: "s3://asf-heat-pump-suitability/source_data/grid_operators/NPg/npg_ndp_demand_headroom.parquet"
    S_SPEN:
      spd_substations: "s3://asf-heat-pump-suitability/source_data/grid_operators/SPEN/distributed-generation-sp-distribution-heat-maps-spd-primary-substations.parquet"
      spd_polygons: "s3://asf-heat-pump-suitability/source_data/grid_operators/SPEN/ndp-spd-primary-substation-polygons.parquet"
    W_SPEN:
      spm_substations: "s3://asf-heat-pump-suitability/source_data/grid_operators/SPEN/distributed-generation-sp-manweb-heat-maps-spm-primary-substations.parquet"
      spm_polygons: "s3://asf-heat-pump-suitability/source_data/grid_operators/SPEN/ndp-spm-primary-group-polygons.parquet"
    ES_SSEN:
      demand: "s3://asf-heat-pump-suitability/source_data/grid_operators/SSEN/demand-heat-map-update-feb-24.xlsx - PS.csv"
    S_SSEN:
      sepd_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/SSEN/SEPD Primary Sub Boundaries"
    E_SSEN:
      shepd_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/SSEN/SHEPD Primary Sub Boundaries"
    E_UKPN:
      primaries: "s3://asf-heat-pump-suitability/source_data/grid_operators/UKPN/ukpn_primary_postcode_area.parquet"
    EW_WPD:
      capacity: "s3://asf-heat-pump-suitability/source_data/grid_operators/WPD/wpd-network-capacity-map.csv"
    E_WPD:
      east_midlands_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/WPD/east-midlands-primary.gpkg"
      south_west_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/WPD/south-west-primary.gpkg"
      west_midlands_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/WPD/west-midlands-primary.gpkg"
    W_WPD:
      south_wales_bounds: "s3://asf-heat-pump-suitability/source_data/grid_operators/WPD/south-wales-primary.gpkg"

  # World Heritage Sites
  world_heritage_sites:
    S_historic_environment_scotland: "s3://asf-heat-pump-suitability/source_data/WHS/World_Heritage_Sites.shp"

  # Inspire URLs
  inspire_urls:
    S_ros: "https://ros.locationcentre.co.uk/inspire/"
    EW: "https://use-land-property-data.service.gov.uk/datasets/inspire/download"

  # Scottish Government Data Zone Boundaries
  scottish_gov_dz_boundaries:
    S: "s3://asf-heat-pump-suitability/source_data/2014_Scottish_Government_DataZoneBoundaries_2011_S/SG_DataZone_Bdry_2011.shp"

  # NRS Scotland Dwellings
  nrs_scotland_dwellings:
    S: "s3://asf-heat-pump-suitability/source_data/June2024_NRScotland_households_and_dwellings_S.xlsx"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. Thank you for taking the time to do this! We should definitely clean the data sources in here as it is getting messy and confusing to find things. I especially like that you came up with the ["EW"] and ["S"] keys!

I will suggest we create a separate issue and PR for this as this would require a lot of refactoring. Although it should be easy to do, we have to do it carefully to ensure that all the places where the config is called is updated with the appropriate keys.

We also will have to update the config/README config key column.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened a new issue #92 for this :) thank you!


from asf_heat_pump_suitability import config
from asf_heat_pump_suitability.getters import base_getters


def get_df_target_nrooms() -> pl.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't include Scotland in the get_df_target_nrooms() function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot thank you! At the moment we don't use the get_df_target_nrooms() function at all but I left it in the code from when I originally wrote it in case we wanted to implement it at a later point. I will add a # TODO note to flag that Scotland would need to be added if we do use it.

)
.drop(["Area Name"])
.rename({"Area Code": "lsoa"})
.rename({"Type of accomodation": "lsoa"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding this, but why are we renaming this from "Type of accommodation" to "lsoa"?

"BUILT_FORM",
"CONSTRUCTION_AGE_BAND",
],
)

# Join ONSPD LSOA col
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ONSPD stand for ONS Postcode Directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

"BUILT_FORM",
"CONSTRUCTION_AGE_BAND",
],
)

# Join ONSPD LSOA col
epc_df = output_areas.standardise_col_postcode(epc_df, pcd_col="POSTCODE")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe after standardisation you do a regex check to see if the postcode is in the correct format:

valid_postcode_mask = epc_df["POSTCODE"].str.contains(r"^[A-Z0-9 ]+$")
assert valid_postcode_mask.all(), "Some standardised postcodes contain invalid characters."

@@ -102,66 +110,79 @@ def parse_arguments() -> argparse.Namespace:
# 1. Add standardised weighting feature columns to EPC and drop rows missing data required for reweighting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially something like:
assert (epc_df["lsoa"].is_null().sum() / epc_df.shape[0]) < 0.05, "More than 5% of rows missing lsoa after ONSPD join, which is unexpected."

logging.info(
f"Running reweighting for {key}. Reweighting using the following features: {features}"
)
epc_cleaned_df = epc_df.filter(pl.col("COUNTRY") == key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert epc_cleaned_df.shape[0] > 0, f"No rows found for {key}, expected some data."

weights["UPRN"].extend(_weights["UPRN"])
# Adding LSOA required for dummy rows
weights["lsoa"].extend([lsoa for i in range(len(_weights["UPRN"]))])
weights["weight"].extend(_weights["weight"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question more so for you, but maybe we can make an assert statement here for the weights to make sure that they all add up to 1 overall (if that makes sense?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Scottish datasets for reweighting
2 participants