Fix duplicated UPRN column after reweighting #88

crispy-wonton · 2024-11-27T17:58:09Z

Reweighting outputs 2 columns: UPRN and UPRN_right.

It seems to come from the reweighting pipeline because it already exists in the dataset that we load in at the beginning of run_add_features.py. I think it originates from the outer join we use to join the weights to the epc_df. This is an expected polars behaviour that I forgot to account for.

This means, rows which have nulls in UPRN_right should be missing weights. They will either be missing weights because:

They are in an LSOA which was skipped because we are missing census data for it (mainly affects Scotland, but also England and Wales in some instances)
We had to drop the row for weighting because the row had a category which was not found in the target data for that LSOA. E.g. if the property type of the row is 'flat' but the census has 0% flats for that LSOA, the row will be dropped.
They are in Scotland. This dataset was run before reweighting for Scotland was added, so all Scottish rows should be missing a weight.

Here is the count of rows with missing UPRN_right for each country:

>>> epc.filter(pl.col("UPRN_right").is_null())["COUNTRY"].value_counts()
shape: (3, 2)
┌──────────┬─────────┐
│ COUNTRY  ┆ count   │
│ ---      ┆ ---     │
│ cat      ┆ u32     │
╞══════════╪═════════╡
│ Wales    ┆ 38285   │
│ England  ┆ 788475  │
│ Scotland ┆ 1473612 │

All of them have null weights, as expected:

>>> epc.filter(pl.col("UPRN_right").is_null())["weight"].value_counts()
shape: (1, 2)
┌────────┬─────────┐
│ weight ┆ count   │
│ ---    ┆ ---     │
│ f64    ┆ u32     │
╞════════╪═════════╡
│ null   ┆ 2300372 │

Originally posted by @crispy-wonton in #70 (comment)

The text was updated successfully, but these errors were encountered:

crispy-wonton added the bug Something isn't working label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicated UPRN column after reweighting #88

Fix duplicated UPRN column after reweighting #88

crispy-wonton commented Nov 27, 2024

Fix duplicated UPRN column after reweighting #88

Fix duplicated UPRN column after reweighting #88

Comments

crispy-wonton commented Nov 27, 2024