Update correcting sjoin in run_add_features #95

crispy-wonton · 2024-12-20T15:48:51Z

run_add_features.py has lines to correct the LAD code for edge cases where it is incorrectly assigned (due to errors in the original EPC postcode data). The correction uses a geospatial join. However, this adds 1hr+ to the time it takes to run the script, for the little value add of correcting a few edge cases. It needs to be sped up.

The original edge case was found when we were adding the conservation area feature based on LAD code and noticed that a single EPC record had an erroneous postcode and therefore had been attributed to the wrong local authority, thus being incorrectly assigned a flag that said it was in an LA without conservation area data while also being flagged as in a conservation area.

See @lizgzil suggestion below:

Something that might be quicker is to do the sjoin in batches in sjoin_df_uprn_lad_code.

from tqdm import tqdm
import pandas as pd

def list_chunks(orig_list: list, chunk_size: int = 100):
    """Chunks list into batches of a specified chunk_size."""
    for i in range(0, len(orig_list), chunk_size):
        yield orig_list[i : i + chunk_size]
        

def sjoin_df_uprn_lad_code(gdf: gpd.GeoDataFrame) -> pl.DataFrame:
    """
    Geospatial join between UPRNs with x,y coordinates and local authority (LAD) boundaries to match UPRNs with the code for
    the local authority they are located in. Null LAD codes are filled with LAD codes matched to UPRN on postcode.

    Args:
        gdf (gpd.GeoDataFrame): dataframe with point geometries per UPRN in BNG, and LAD code from postcode

    Returns:
        pl.DataFrame: UPRNs with matched local authority code
    """
    lad_bounds_gdf = get_datasets.load_gdf_ons_council_bounds(
        columns=["LAD23CD", "geometry"]
    )
    df_a1 = pd.DataFrame()
    for chunk_gdf in tqdm(list_chunks(gdf, chunk_size = 10000)):
        chunk_gdf_result = chunk_gdf.sjoin(lad_bounds_gdf, how="left", predicate="intersects")
        df_a1 = pd.concat([df_a1, chunk_gdf_result])
    gdf = df_a1
    gdf["lad_code"] = gdf["LAD23CD"].fillna(gdf["lad_code"])
    return pl.from_pandas(gdf[["UPRN", "lad_code"]])

excuse my variable namings.

Investigating different chunk_size, I got the following run times per iteration:

# 50k: 34.52s/it, 33.90s/it, 33.72s/it
# 10k: 9.13s/it, 6.87s/it, 6.99s/it, 7.08s/it, 6.90s/it
# 1k: 1.46it/s, 1.59, 1.41, 1.39, 1.43, 1.46, 1.40

Originally posted by @lizgzil in #93 (comment)

The text was updated successfully, but these errors were encountered:

crispy-wonton mentioned this issue Dec 20, 2024

86 update run scripts #93

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update correcting sjoin in run_add_features #95

Update correcting sjoin in run_add_features #95

crispy-wonton commented Dec 20, 2024 •

edited

Loading

Update correcting sjoin in run_add_features #95

Update correcting sjoin in run_add_features #95

Comments

crispy-wonton commented Dec 20, 2024 • edited Loading

crispy-wonton commented Dec 20, 2024 •

edited

Loading