-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for model-build failure due to presence of survey inputs as a dictionary #954
Merged
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify
multilabel
andsurvey
data so that both of them are stored astrip_user_input
. That will make the rest of the code much simpler because we don't need to special case input handling in general.It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I saw that the survey inputs were being stored in
trip_user_input
, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:
I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:
I found some code in
emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py
.I noticed survey inputs have these keys:
manual/trip_user_input, manual/trip_addition_input
.While multilabel inputs have these ones:
manual/mode_confirm, manual/purpose_confirm
.However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:
While we have the multilabel data as the labels simply as a list of tuples.
So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not a performance check, then why did you need it?
Basically, since you are already handling mixed inputs, it seems like the check for
trip_user_input
is unnecessary unless you want to use it as a performance fix.We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.
My comment was:
"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"
So your options are:
user_input
data model for different configurations e-mission-docs#1045dict
check without the initialtrip_user_input
checkThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the
trip_user_input
column of the data frame like df[col_name] before applying the filter:user_label_df['trip_user_input']
As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.
Once the unification is done under
trip_user_input
column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.P.S. I've modified the filtering approach as mentioned in below comments.