Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGFIX avg garden size #85

Merged
merged 5 commits into from
Jan 8, 2025
Merged

BUGFIX avg garden size #85

merged 5 commits into from
Jan 8, 2025

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Nov 26, 2024

Fixes #84

Please could you double check the reasoning in the issue description and that the bugfix is appropriate.

I have tested and it works. If you want to test it run:

python -i asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py --epc_path s3://asf-heat-pump-suitability/outputs/2023Q4/20240824_2023_Q4_EPC_weighted.parquet -y 2023 -q 4

I would advise commenting out everything from line 106 onwards in run_add_features.py before running and checking the output of epc_df in terminal. This will be much quicker.

Outputs from my test:

>>> epc_df["msoa_avg_outdoor_space_property_type"].unique()
shape: (3,)
Series: 'msoa_avg_outdoor_space_property_type' [str]
[
	"unknown"
	"Flats"
	"Houses"
]

>>> garden_space_avg_msoa_df["msoa_avg_outdoor_space_property_type"].unique()
shape: (3,)
Series: 'msoa_avg_outdoor_space_property_type' [str]
[
	"Houses"
	"Flats"
	"unknown"
]

We can see the categories of this feature now match in the supplementary dataset (msoa_avg_outdoor_space_property_type) and the epc_df EPC dataset.

@crispy-wonton crispy-wonton requested a review from lizgzil January 2, 2025 15:57
lizgzil
lizgzil previously approved these changes Jan 7, 2025
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes more sense! The code ran fine for me with

epc_df["msoa_avg_outdoor_space_property_type"].unique()
shape: (3,)
Series: 'msoa_avg_outdoor_space_property_type' [str]
[
	"Houses"
	"unknown"
	"Flats"
]

@crispy-wonton
Copy link
Collaborator Author

crispy-wonton commented Jan 8, 2025

@lizgzil this is ready for another review.

I had to change the way the function identifies houses. This function is used in the run_add_features.py script. Due to changes in processing of the EPC property_type column that have now been merged to dev, property_type detached, semi-detached, and terraced houses no longer contain the substring house as they did before, hence the change in the way to identify these property types.

I have tested the run_add_features.py script as before (commenting out lines 138 onwards), see results below:

python -i asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4

>>> epc_df["msoa_avg_outdoor_space_property_type"].value_counts()
shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ msoa_avg_outdoor_space_propert… ┆ count   │
│ ---                             ┆ ---     │
│ str                             ┆ u32     │
╞═════════════════════════════════╪═════════╡
│ Flats                           ┆ 6204920 │
│ Houses                          ┆ 9112213 │
│ unknown                         ┆ 4910012 │
└─────────────────────────────────┴─────────┘

@lizgzil
Copy link
Collaborator

lizgzil commented Jan 8, 2025

hey @crispy-wonton - I just ran:

epc_path= 's3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet'
epc_df = pl.read_parquet(epc_path)
list(epc_df['PROPERTY_TYPE'].unique())

and got ['Bungalow', 'House', 'Maisonette', 'Flat', 'Park home']. Am I using the right EPC file?

@crispy-wonton
Copy link
Collaborator Author

hey @crispy-wonton - I just ran:

epc_path= 's3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet'
epc_df = pl.read_parquet(epc_path)
list(epc_df['PROPERTY_TYPE'].unique())

and got ['Bungalow', 'House', 'Maisonette', 'Flat', 'Park home']. Am I using the right EPC file?

That's the input EPC data file. To generate property_type from this EPC dataset, we combine PROPERTY_TYPE and BUILT_FORM columns. If you are trying to look at the results, I just created and saved out a test file:

s3://asf-heat-pump-suitability/outputs/2023Q4/20250108_2023_Q4_TEST.parquet

Results:

>>> import polars as pl
>>> epc_df = pl.read_parquet("s3://asf-heat-pump-suitability/outputs/2023Q4/20250108_2023_Q4_TEST.parquet")
>>> epc_df["property_type"].value_counts()
shape: (6, 2)
┌─────────────────────────────────┬─────────┐
│ property_type                   ┆ count   │
│ ---                             ┆ ---     │
│ str                             ┆ u32     │
╞═════════════════════════════════╪═════════╡
│ Terraced (including end-terrac… ┆ 5292277 │
│ Flat, maisonette or apartment   ┆ 6204920 │
│ Detached                        ┆ 3819936 │
│ null                            ┆ 60589   │
│ Caravan or other mobile or tem… ┆ 10557   │
│ Semi-detached                   ┆ 4838866 │
└─────────────────────────────────┴─────────┘
>>> epc_df["msoa_avg_outdoor_space_property_type"].value_counts()
shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ msoa_avg_outdoor_space_propert… ┆ count   │
│ ---                             ┆ ---     │
│ str                             ┆ u32     │
╞═════════════════════════════════╪═════════╡
│ Houses                          ┆ 9112213 │
│ Flats                           ┆ 6204920 │
│ unknown                         ┆ 4910012 │
└─────────────────────────────────┴─────────┘

Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crispy-wonton thanks! This looks good! I got your results. My previous comment was due to some confusion after having not pulled the branch properly!!

@crispy-wonton crispy-wonton merged commit a6c6c23 into dev Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG in code for avg MSOA garden size
2 participants