Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated UPRN column after reweighting #88

Open
crispy-wonton opened this issue Nov 27, 2024 · 0 comments
Open

Fix duplicated UPRN column after reweighting #88

crispy-wonton opened this issue Nov 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@crispy-wonton
Copy link
Collaborator

Reweighting outputs 2 columns: UPRN and UPRN_right.

It seems to come from the reweighting pipeline because it already exists in the dataset that we load in at the beginning of run_add_features.py. I think it originates from the outer join we use to join the weights to the epc_df. This is an expected polars behaviour that I forgot to account for.

This means, rows which have nulls in UPRN_right should be missing weights. They will either be missing weights because:

  • They are in an LSOA which was skipped because we are missing census data for it (mainly affects Scotland, but also England and Wales in some instances)
  • We had to drop the row for weighting because the row had a category which was not found in the target data for that LSOA. E.g. if the property type of the row is 'flat' but the census has 0% flats for that LSOA, the row will be dropped.
  • They are in Scotland. This dataset was run before reweighting for Scotland was added, so all Scottish rows should be missing a weight.

Here is the count of rows with missing UPRN_right for each country:

>>> epc.filter(pl.col("UPRN_right").is_null())["COUNTRY"].value_counts()
shape: (3, 2)
┌──────────┬─────────┐
│ COUNTRY  ┆ count   │
│ ---      ┆ ---     │
│ cat      ┆ u32     │
╞══════════╪═════════╡
│ Wales    ┆ 38285   │
│ England  ┆ 788475  │
│ Scotland ┆ 1473612 │

All of them have null weights, as expected:

>>> epc.filter(pl.col("UPRN_right").is_null())["weight"].value_counts()
shape: (1, 2)
┌────────┬─────────┐
│ weight ┆ count   │
│ ---    ┆ ---     │
│ f64    ┆ u32     │
╞════════╪═════════╡
│ null   ┆ 2300372 │

Originally posted by @crispy-wonton in #70 (comment)

@crispy-wonton crispy-wonton added the bug Something isn't working label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant