Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
65 reweighting Scottish EPC records #90
base: dev
Are you sure you want to change the base?
65 reweighting Scottish EPC records #90
Changes from all commits
13b82be
a489ed2
3f57d9f
0ccc110
5e2b02a
7bc5030
8daf295
20be010
ae39b1e
82861c7
a2026d7
288ff37
12d3d4c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we don't include Scotland in the
get_df_target_nrooms()
function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot thank you! At the moment we don't use the
get_df_target_nrooms()
function at all but I left it in the code from when I originally wrote it in case we wanted to implement it at a later point. I will add a # TODO note to flag that Scotland would need to be added if we do use it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding this, but why are we renaming this from "Type of accommodation" to "lsoa"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of number are we talking about here? Any idea why this might be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was 4 rows in the target dataset (equivalent to 4 Data Zones). I'm not sure why this is the case but I assume this is an error in the NRS dataset as I checked, and it is present in the raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is to do on this? Raise an error when this occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line below :)
# TODO we need to check if len(sample) > 0 and only proceed with remaining code if it is
I don't think it should error out.
Basically there is an edge case where all the rows from the EPC sample for a specific LSOA are removed in the preprocessing for weighting. The case I found was a Scottish Data Zone which had only 1 EPC record left in the sample by the time it got to this stage of the reweighting pipeline.
At this point in the code where I left these comments, the code checks that the EPC subsample (in this case 1 row) only has categories that appear in the target (census) data for that LSOA. E.g. let's say the row has property type == flat. But in the census data maybe this LSOA has a count of 0 for flats. The current pipeline will remove all rows with flats from the EPC subsample, because we can't reweight something to 0 (because we are reweighting in multiple dimensions).
Therefore in this edge case, all rows are removed.
The next step is to check which categories are present in the target but missing from the sample. E.g. the target data may show the LSOA has 500 detached houses. Then the pipeline will append dummy rows with these missing variables to the subsample. E.g. it will try to add a dummy row with property type == detached house.
The error is actually caused in this edge case by this dummy generation. The code attempts to append a dummy row onto a dataframe, but as the dataframe has no data it fails.
Ultimately, we need to update it so that the pipeline recognises when this happens and then skips reweighting an LSOA in this case. If there are no rows, there is no data left to reweight anyway.