Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update correcting sjoin in run_add_features #95

Open
crispy-wonton opened this issue Dec 20, 2024 · 0 comments
Open

Update correcting sjoin in run_add_features #95

crispy-wonton opened this issue Dec 20, 2024 · 0 comments

Comments

@crispy-wonton
Copy link
Collaborator

crispy-wonton commented Dec 20, 2024

run_add_features.py has lines to correct the LAD code for edge cases where it is incorrectly assigned (due to errors in the original EPC postcode data). The correction uses a geospatial join. However, this adds 1hr+ to the time it takes to run the script, for the little value add of correcting a few edge cases. It needs to be sped up.

The original edge case was found when we were adding the conservation area feature based on LAD code and noticed that a single EPC record had an erroneous postcode and therefore had been attributed to the wrong local authority, thus being incorrectly assigned a flag that said it was in an LA without conservation area data while also being flagged as in a conservation area.

See @lizgzil suggestion below:

Something that might be quicker is to do the sjoin in batches in sjoin_df_uprn_lad_code.

from tqdm import tqdm
import pandas as pd

def list_chunks(orig_list: list, chunk_size: int = 100):
    """Chunks list into batches of a specified chunk_size."""
    for i in range(0, len(orig_list), chunk_size):
        yield orig_list[i : i + chunk_size]
        

def sjoin_df_uprn_lad_code(gdf: gpd.GeoDataFrame) -> pl.DataFrame:
    """
    Geospatial join between UPRNs with x,y coordinates and local authority (LAD) boundaries to match UPRNs with the code for
    the local authority they are located in. Null LAD codes are filled with LAD codes matched to UPRN on postcode.

    Args:
        gdf (gpd.GeoDataFrame): dataframe with point geometries per UPRN in BNG, and LAD code from postcode

    Returns:
        pl.DataFrame: UPRNs with matched local authority code
    """
    lad_bounds_gdf = get_datasets.load_gdf_ons_council_bounds(
        columns=["LAD23CD", "geometry"]
    )
    df_a1 = pd.DataFrame()
    for chunk_gdf in tqdm(list_chunks(gdf, chunk_size = 10000)):
        chunk_gdf_result = chunk_gdf.sjoin(lad_bounds_gdf, how="left", predicate="intersects")
        df_a1 = pd.concat([df_a1, chunk_gdf_result])
    gdf = df_a1
    gdf["lad_code"] = gdf["LAD23CD"].fillna(gdf["lad_code"])
    return pl.from_pandas(gdf[["UPRN", "lad_code"]])
        

excuse my variable namings.

Investigating different chunk_size, I got the following run times per iteration:

# 50k: 34.52s/it, 33.90s/it, 33.72s/it
# 10k: 9.13s/it, 6.87s/it, 6.99s/it, 7.08s/it, 6.90s/it
# 1k: 1.46it/s, 1.59, 1.41, 1.39, 1.43, 1.46, 1.40

Originally posted by @lizgzil in #93 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant