Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

65 reweighting Scottish EPC records #90

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
13b82be
add getters to load scottish census tenure and property type datasets…
crispy-wonton Sep 24, 2024
a489ed2
add new file paths to config for tenure and property type from scotti…
crispy-wonton Sep 24, 2024
3f57d9f
update property type category names in prepare_sample.py
crispy-wonton Sep 24, 2024
0ccc110
update functions in prepare_target.py to use EW and Scottish data for…
crispy-wonton Sep 24, 2024
5e2b02a
update run_compute_epc_weights to allow reweighting of Scotland with …
crispy-wonton Nov 28, 2024
7bc5030
Merge branch 'dev' into 65_scottish_reweighting
crispy-wonton Nov 28, 2024
8daf295
update tenure categories in target data to align with EPC in get_targ…
crispy-wonton Nov 28, 2024
20be010
fix bug in get_target.py
crispy-wonton Nov 29, 2024
ae39b1e
add TODO note to reweight_epc.py to prevent erroring out of pipeline …
crispy-wonton Nov 29, 2024
82861c7
Merge branch 'dev' into 65_scottish_reweighting
crispy-wonton Nov 29, 2024
a2026d7
remove epc use_cols from base config
crispy-wonton Nov 29, 2024
288ff37
remove unused property_type raw getter from get_datasets.py and base …
crispy-wonton Nov 29, 2024
12d3d4c
add LSOA to weights dict in run_compute_epc_weights.py to retain LSOA…
crispy-wonton Nov 29, 2024
c0a3264
add TODO note in get_target.py to add Scottish nrooms target data if …
crispy-wonton Dec 20, 2024
ad74000
add notation to get_target.py for clarity
crispy-wonton Dec 20, 2024
0de7a2c
update docstring at the top of run_compute_epc_weights.py to explain …
crispy-wonton Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 2 additions & 20 deletions asf_heat_pump_suitability/config/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ data_source:
gb_ons_postcode_dir_url: "https://www.arcgis.com/sharing/rest/content/items/487a5ba62c8b4da08f01eb3c08e304f6/data" # Aug 2023 data
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
gb_ons_postcode_dir_file_path: "Data/ONSPD_AUG_2023_UK.csv" # Aug 2023 data
UK_ons_postcode_dir: "s3://asf-heat-pump-suitability/source_data/ONSPD_AUG_2023_UK.csv"
EW_census_housing_characteristics: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023update_housing_characteristics_E_W.xlsx" # 2021 census, Mar 2023 update
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
EW_census_tenure: "s3://asf-heat-pump-suitability/source_data/2021census_2023Mar_tenure_E_W.csv"
S_census_tenure: "s3://asf-heat-pump-suitability/source_data/2022_Scotlands_census_tenure_S.csv"
EW_census_number_of_rooms: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_number_of_rooms_E_W.csv"
EW_census_number_of_households: "s3://asf-heat-pump-suitability/source_data/2021_vMar2023_census_numberofhouseholds_EW.csv"
EW_census_land_area: "s3://asf-heat-pump-suitability/source_data/2021_vMar2021_census_landareaKM_EW.csv"
Expand All @@ -12,6 +12,7 @@ data_source:
GB_ons_garden_space_access: "s3://asf-heat-pump-suitability/source_data/ONS_Apr2020_access_to_garden_space.xlsx"
GB_osopen_uprn_latlon: "s3://asf-heat-pump-suitability/source_data/osopenuprn_202405_csv.zip"
EW_census_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_accommodation_type_E_W.csv"
S_census_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2022_Scotlands_census_accommodation_type_S.csv"
UK_ons_lad_bounds: "s3://asf-heat-pump-suitability/source_data/Local_Authority_Districts_December_2023_Boundaries_UK_BFE_-2600600853110041429/LAD_DEC_2023_UK_BFE.shp"
EW_inspire_land_extent_dir: "s3://asf-heat-pump-suitability/source_data/inspire_ew/"
S_inspire_land_extent_dir: "s3://asf-heat-pump-suitability/source_data/inspire_scotland/"
Expand Down Expand Up @@ -48,25 +49,6 @@ data_source:
EW_inspire_url: "https://use-land-property-data.service.gov.uk/datasets/inspire/download"
S_scottish_gov_DZ2011_boundaries: "s3://asf-heat-pump-suitability/source_data/2014_Scottish_Government_DataZoneBoundaries_2011_S/SG_DataZone_Bdry_2011.shp"
S_NRScotland_dwellings: "s3://asf-heat-pump-suitability/source_data/June2024_NRScotland_households_and_dwellings_S.xlsx"
usecols:
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
epc:
- COUNTRY
- ADDRESS1
- ADDRESS2
- POSTCODE
- CURRENT_ENERGY_EFFICIENCY
- CURRENT_ENERGY_RATING
- CURR_ENERGY_RATING_NUM
- ENERGY_RATING_CAT
- UPRN
- TENURE
- PROPERTY_TYPE
- BUILT_FORM
- CONSTRUCTION_AGE_BAND
- CO2_EMISSIONS_CURRENT
- NUMBER_HABITABLE_ROOMS
- HEATING_SYSTEM
- HEATING_FUEL
mapping:
build_year_pre_cols:
- BP_PRE_1900
Expand Down
212 changes: 122 additions & 90 deletions asf_heat_pump_suitability/getters/get_target.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import polars as pl
import warnings
import polars.selectors as cs

from asf_heat_pump_suitability import config
from asf_heat_pump_suitability.getters import base_getters


# TODO will need to add number of rooms target data for Scotland if we revert to using it
def get_df_target_nrooms() -> pl.DataFrame:
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
"""
Get dataframe of counts of total number of rooms for properties in all LSOAs in England and Wales. Where number of rooms
Expand Down Expand Up @@ -34,10 +34,23 @@ def get_df_target_nrooms() -> pl.DataFrame:
return df


def get_df_target_property_type_uncensored() -> pl.DataFrame:
def transform_df_target_property_type() -> pl.DataFrame:
"""
Get dataframe of property type counts for all LSOAs in England and Wales. Dataframe has no censored values. Source:
census data 2021.
Load and transform property type counts per LSOA/data zone for England, Scotland, and Wales from census data.

Returns:
pl.DataFrame: property type counts for England, Scotland, and Wales per LSOA
"""
ew_df = load_transform_df_target_property_type_ew()
s_df = load_transform_df_target_property_type_scotland()
s_df = s_df.select(ew_df.columns)

return pl.concat([ew_df, s_df], how="vertical")


def load_transform_df_target_property_type_ew() -> pl.DataFrame:
"""
Get dataframe of property type counts for all LSOAs in England and Wales from census data.

Returns:
pl.Dataframe: counts of property type for all LSOAs in England and Wales
Expand Down Expand Up @@ -78,50 +91,130 @@ def get_df_target_property_type_uncensored() -> pl.DataFrame:
)
.rename(
{
"Detached": "Detached whole house or bungalow",
"Semi-detached": "Semi-detached whole house or bungalow",
"Terraced": "Terraced (including end-terrace) whole house or bungalow",
"Terraced": "Terraced (including end-terrace)",
"A caravan or other mobile or temporary structure": "Caravan or other mobile or temporary structure",
}
)
)

return df


def get_df_target_property_type(fill_censored: int = 1) -> pl.DataFrame:
def load_transform_df_target_property_type_scotland() -> pl.DataFrame:
"""
Get dataframe of property type counts for all LSOAs in England and Wales, and fill censored values (counts below 10)
with given constant. Source: census data 2021.

Args:
fill_censored (int): value to fill censored values with, [0-10]. Default 0.
Load and transform dataframe of property type counts for data zones in Scotland from census data.

Returns:
pl.Dataframe: counts of property type for all LSOAs in England and Wales
pl.Dataframe: counts of property type for all data zones in Scotland
"""
content = base_getters.get_content_from_s3_path(
config["data_source"]["EW_census_housing_characteristics"]
df = pl.read_csv(
config["data_source"]["S_census_accommodation_type"],
skip_rows=10,
columns=list(range(0, 11)),
infer_schema_length=10000,
)
df = pl.read_excel(content, sheet_name="2c", engine="calamine")

# Remove empty header rows
df = (
df.rename(df[2].to_dicts().pop())
.slice(
3,
df[1:]
.drop_nulls(subset=cs.numeric())
.drop(
[
"Whole house or bungalow: Total",
"Flat, maisonette or apartment: Total",
"All occupied households",
]
)
.drop(["Area Name"])
.rename({"Area Code": "lsoa"})
)
df = _fill_df_censored_values(df, fill_censored)
flats_cols = [col for col in df.columns if "Flat" in col]
df = (
df.with_columns(
pl.sum_horizontal(flats_cols).alias("Flat, maisonette or apartment")
)
.drop(flats_cols)
.rename(
{
col: col.replace("Whole house or bungalow: ", "")
for col in df.select(cs.numeric()).columns
}
)
.rename(
{"Type of accomodation": "lsoa"}
) # The Data Zone (lsoa) column name is mislabelled due to .csv formatting
)

# A small number of rows seem to erroneously have zero values for all property types, we need to remove them
crispy-wonton marked this conversation as resolved.
Show resolved Hide resolved
df = df.filter(
pl.sum_horizontal(
[
"Detached",
"Semi-detached",
"Terraced (including end-terrace)",
"Caravan or other mobile or temporary structure",
"Flat, maisonette or apartment",
]
)
!= 0
)

return df


def get_df_target_tenure_uncensored() -> pl.DataFrame:
def transform_df_target_tenure() -> pl.DataFrame:
"""
Load and transform tenure type counts per LSOA/data zone for England, Scotland, and Wales from census data.

Returns:
pl.DataFrame: tenure type counts per LSOA/data zone for England, Scotland, and Wales
"""
ew_df = load_transform_df_target_tenure_ew()
s_df = load_transform_df_target_tenure_scotland()
s_df = s_df.select(ew_df.columns)

return pl.concat([ew_df, s_df], how="vertical")


def load_transform_df_target_tenure_scotland() -> pl.DataFrame:
"""
Load and transform tenure type counts per data zone in Scotland from census data.

Returns:
pl.DataFrame: tenure type counts per data zone in Scotland
"""
df = pl.read_csv(
config["data_source"]["S_census_tenure"],
skip_rows=10,
columns=list(range(1, 4)),
infer_schema_length=10000,
)
df = (
df.drop_nulls()
.rename({"Intermediate Zone - Data Zone 2011": "lsoa"})
.pivot("Household Tenure", index="lsoa", values="Count")
.drop([col for col in df.columns if "Total" in col])
)
private_rental = [col for col in df.columns if "Private" in col]
private_rental.extend(["Lives Rent Free"])
df = df.with_columns(
pl.sum_horizontal([col for col in df.columns if "Owned" in col]).alias(
"owner-occupied"
),
pl.sum_horizontal(private_rental).alias("rental (private)"),
pl.sum_horizontal([col for col in df.columns if "Social" in col]).alias(
"rental (social)"
),
)

# A small number of rows seem to erroneously have zero values for all tenure types, we need to remove them
df = df.filter(
pl.sum_horizontal(["owner-occupied", "rental (social)", "rental (private)"])
!= 0
)

return df.select(["lsoa", "owner-occupied", "rental (social)", "rental (private)"])


def load_transform_df_target_tenure_ew() -> pl.DataFrame:
"""
Get dataframe of tenure type counts for all LSOAs in England and Wales. Dataframe has no censored values. Source:
census data 2021.
Get dataframe of tenure type counts for all LSOAs in England and Wales from census data.

Returns:
pl.Dataframe: counts of tenure type for all LSOAs in England and Wales
Expand Down Expand Up @@ -168,44 +261,6 @@ def get_df_target_tenure_uncensored() -> pl.DataFrame:
return df


def get_df_target_tenure(fill_censored: int = 1) -> pl.DataFrame:
"""
Get dataframe of tenure type counts for all LSOAs in England and Wales, and fill censored values (counts below 10)
with given constant. Source: census data 2021.

Args:
fill_censored (int): value to fill censored values with, [0-10]. Default 0.

Returns:
pl.Dataframe: counts of tenure type for all LSOAs in England and Wales
"""
content = base_getters.get_content_from_path(
config["data_source"]["EW_census_housing_characteristics"]
)
df = pl.read_excel(content, sheet_name="3c", engine="calamine")

# Remove empty header rows
df = (
df.rename(df[2].to_dicts().pop())
.slice(
3,
)
.drop(["Area Name"])
.rename(
{
"Area Code": "lsoa",
"Owned or shared ownership": "owner-occupied",
"Social Rented": "rental (social)",
"Private Rented or lives rent free": "rental (private)",
}
)
)

df = _fill_df_censored_values(df, fill_censored)

return df


def get_df_target_build_year(
pre_cols: list = config["mapping"]["build_year_pre_cols"],
post_cols: list = config["mapping"]["build_year_post_cols"],
Expand Down Expand Up @@ -250,26 +305,3 @@ def get_df_target_build_year_la() -> pl.DataFrame:
df = df.select(["lsoa", "pre_1930", "post_1930", "unknown"])

return df


def _fill_df_censored_values(df: pl.DataFrame, val: int) -> pl.DataFrame:
"""
Fill censored values in a target dataframe with a given value.

Args:
df (pl.DataFrame): dataframe
val (int): value to fill censored values with, [0-10]

Returns:
pl.DataFrame: dataframe with filled values
"""
if not (0 <= val <= 10):
warnings.warn(
"Value to fill censored target data should be within range [0-10]. "
"Values outside this range may significantly change target proportions."
)
cols = df.columns
cols.remove("lsoa")
df = df.with_columns([pl.col(cols).str.replace("c", f"{val}").cast(pl.Int64)])

return df
Original file line number Diff line number Diff line change
Expand Up @@ -71,21 +71,21 @@ def add_col_property_type(df: pl.DataFrame) -> pl.DataFrame:
pl.col("PROPERTY_TYPE").is_in(["House", "Bungalow"]),
pl.col("BUILT_FORM") == "Detached",
)
.then(pl.lit("Detached whole house or bungalow"))
.then(pl.lit("Detached"))
.when(
pl.col("PROPERTY_TYPE").is_in(["House", "Bungalow"]),
pl.col("BUILT_FORM") == "Semi-Detached",
)
.then(pl.lit("Semi-detached whole house or bungalow"))
.then(pl.lit("Semi-detached"))
.when(
pl.col("PROPERTY_TYPE").is_in(["House", "Bungalow"]),
pl.col("BUILT_FORM").is_in(terraced),
)
.then(pl.lit("Terraced (including end-terrace) whole house or bungalow"))
.then(pl.lit("Terraced (including end-terrace)"))
.when(pl.col("PROPERTY_TYPE").is_in(["Flat", "Maisonette"]))
.then(pl.lit("Flat, maisonette or apartment"))
.when(pl.col("PROPERTY_TYPE").is_in(["Park home"]))
.then(pl.lit("A caravan or other mobile or temporary structure"))
.then(pl.lit("Caravan or other mobile or temporary structure"))
.alias("property_type")
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,9 @@ def get_dict_dfs_counts(
count_dict = {}

if "property_type" in features:
count_dict["property_type"] = (
get_target.get_df_target_property_type_uncensored()
)
count_dict["property_type"] = get_target.transform_df_target_property_type()
if "tenure" in features:
count_dict["tenure"] = get_target.get_df_target_tenure_uncensored()
count_dict["tenure"] = get_target.transform_df_target_tenure()
if "build_year" in features:
if not use_la_build_year:
count_dict["build_year"] = get_target.get_df_target_build_year()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ def generate_balance_sample(
sample = sample.filter(~pl.col(feature).is_in(missing))
lost_rows = len_before - len(sample)

# TODO generating dummies will fail and cause pipeline error if all rows are removed from sample in code above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is to do on this? Raise an error when this occurs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See line below :)

# TODO we need to check if len(sample) > 0 and only proceed with remaining code if it is

I don't think it should error out.

Basically there is an edge case where all the rows from the EPC sample for a specific LSOA are removed in the preprocessing for weighting. The case I found was a Scottish Data Zone which had only 1 EPC record left in the sample by the time it got to this stage of the reweighting pipeline.

At this point in the code where I left these comments, the code checks that the EPC subsample (in this case 1 row) only has categories that appear in the target (census) data for that LSOA. E.g. let's say the row has property type == flat. But in the census data maybe this LSOA has a count of 0 for flats. The current pipeline will remove all rows with flats from the EPC subsample, because we can't reweight something to 0 (because we are reweighting in multiple dimensions).

Therefore in this edge case, all rows are removed.

The next step is to check which categories are present in the target but missing from the sample. E.g. the target data may show the LSOA has 500 detached houses. Then the pipeline will append dummy rows with these missing variables to the subsample. E.g. it will try to add a dummy row with property type == detached house.

The error is actually caused in this edge case by this dummy generation. The code attempts to append a dummy row onto a dataframe, but as the dataframe has no data it fails.

Ultimately, we need to update it so that the pipeline recognises when this happens and then skips reweighting an LSOA in this case. If there are no rows, there is no data left to reweight anyway.

# TODO we need to check if len(sample) > 0 and only proceed with remaining code if it is
# Add dummy rows for feature categories missing from sample but present in target
dummies = generate_df_dummies(lsoa_marginals=lsoa_marginals, sample=sample)
sample = pl.concat([sample, dummies[sample.columns]])
Expand Down
Loading