You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run_add_features.py has lines to correct the LAD code for edge cases where it is incorrectly assigned (due to errors in the original EPC postcode data). The correction uses a geospatial join. However, this adds 1hr+ to the time it takes to run the script, for the little value add of correcting a few edge cases. It needs to be sped up.
The original edge case was found when we were adding the conservation area feature based on LAD code and noticed that a single EPC record had an erroneous postcode and therefore had been attributed to the wrong local authority, thus being incorrectly assigned a flag that said it was in an LA without conservation area data while also being flagged as in a conservation area.
Something that might be quicker is to do the sjoin in batches in sjoin_df_uprn_lad_code.
from tqdm import tqdm
import pandas as pd
def list_chunks(orig_list: list, chunk_size: int = 100):
"""Chunks list into batches of a specified chunk_size."""
for i in range(0, len(orig_list), chunk_size):
yield orig_list[i : i + chunk_size]
def sjoin_df_uprn_lad_code(gdf: gpd.GeoDataFrame) -> pl.DataFrame:
"""
Geospatial join between UPRNs with x,y coordinates and local authority (LAD) boundaries to match UPRNs with the code for
the local authority they are located in. Null LAD codes are filled with LAD codes matched to UPRN on postcode.
Args:
gdf (gpd.GeoDataFrame): dataframe with point geometries per UPRN in BNG, and LAD code from postcode
Returns:
pl.DataFrame: UPRNs with matched local authority code
"""
lad_bounds_gdf = get_datasets.load_gdf_ons_council_bounds(
columns=["LAD23CD", "geometry"]
)
df_a1 = pd.DataFrame()
for chunk_gdf in tqdm(list_chunks(gdf, chunk_size = 10000)):
chunk_gdf_result = chunk_gdf.sjoin(lad_bounds_gdf, how="left", predicate="intersects")
df_a1 = pd.concat([df_a1, chunk_gdf_result])
gdf = df_a1
gdf["lad_code"] = gdf["LAD23CD"].fillna(gdf["lad_code"])
return pl.from_pandas(gdf[["UPRN", "lad_code"]])
excuse my variable namings.
Investigating different chunk_size, I got the following run times per iteration:
run_add_features.py
has lines to correct the LAD code for edge cases where it is incorrectly assigned (due to errors in the original EPC postcode data). The correction uses a geospatial join. However, this adds 1hr+ to the time it takes to run the script, for the little value add of correcting a few edge cases. It needs to be sped up.The original edge case was found when we were adding the conservation area feature based on LAD code and noticed that a single EPC record had an erroneous postcode and therefore had been attributed to the wrong local authority, thus being incorrectly assigned a flag that said it was in an LA without conservation area data while also being flagged as in a conservation area.
See @lizgzil suggestion below:
Something that might be quicker is to do the sjoin in batches in
sjoin_df_uprn_lad_code
.excuse my variable namings.
Investigating different
chunk_size
, I got the following run times per iteration:Originally posted by @lizgzil in #93 (comment)
The text was updated successfully, but these errors were encountered: