-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for model-build failure due to presence of survey inputs as a dictionary #954
Conversation
The idea is to check the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower. These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.
Added log statement to the greedy_similarity_binning, to indicate filtering is being done for the dictionary elements in the dataframe if the column is 'trip_user_input'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more thought and documentation (see below).
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
Outdated
Show resolved
Hide resolved
@@ -295,6 +295,10 @@ def _generate_predictions(self): | |||
# compute unique label sets and their probabilities in one cluster | |||
# 'p' refers to probability | |||
group_cols = user_label_df.columns.tolist() | |||
# Filtering out rows from the user_label_df if they are dictionary objects which come from the survey inputs provided by the users instead of multilabels | |||
if 'trip_user_input' in group_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify multilabel
and survey
data so that both of them are stored as trip_user_input
. That will make the rest of the code much simpler because we don't need to special case input handling in general.
It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I saw that the survey inputs were being stored in trip_user_input
, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.
With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:
"the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels."
I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:
- Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?
- On what basis to distinguish?
I found some code in emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py
.
# multi entry multilabel
multi_entry_multilabel = [
{"metadata": {"key": "manual/mode_confirm", "write_ts": 1, "write_fmt_time": "1"},
"data": {"start_ts": 8, "label": "foo"}},
...]
# multi entry survey
multi_entry_multilabel = [
{"metadata": {"key": "manual/trip_user_input", "write_ts": 1, "write_fmt_time": "1"},
"data": {"xmlResponse": "<foo></foo>", "start_ts": 8, "start_fmt_time": 8}},
...]
I noticed survey inputs have these keys: manual/trip_user_input, manual/trip_addition_input
.
While multilabel inputs have these ones: manual/mode_confirm, manual/purpose_confirm
.
However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:
2024-01-25 19:51:29,027:DEBUG:{'_id': ObjectId('640bd0ee80ea0c11040a8def'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': {'key': 'manual/trip_user_input', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1678409128.236, 'write_local_dt': {'year': 2023, 'month': 3, 'day': 9, 'hour': 16, 'minute': 45, 'second': 28, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-03-09T16:45:28.236000-08:00'}, 'data': {'label': '1 purpose, 1 mode', 'name': 'TripConfirmSurvey', 'version': 1.2, 'xmlResponse': '<data xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="snapshot_xml">\n <start>2023-03-09T16:45:05.143-08:00</start>\n <end>2023-03-09T16:45:05.144-08:00</end>\n <destination_purpose>buy_something</destination_purpose>\n <travel_mode>bus</travel_mode>\n <Total_people_in_trip_party>2</Total_people_in_trip_party>\n <Non_household_member_s_on_trip>0</Non_household_member_s_on_trip>\n <Vehicle_trip_Parking_location>1</Vehicle_trip_Parking_location>\n <Parking_cost>0</Parking_cost>\n <Total_toll_charges_p_uring_the_trip_AUD>0</Total_toll_charges_p_uring_the_trip_AUD>\n <Transit_fees_AUD>2.5</Transit_fees_AUD>\n <Taxi_fees/>\n <meta>\n <instanceID>uuid:1faa14c9-7d59-4293-bbaf-6efedda67295</instanceID>\n </meta>\n </data>', 'jsonDocResponse': {'data': {'attr': {'xmlns:jr': 'http://openrosa.org/javarosa', 'xmlns:odk': 'http://www.opendatakit.org/xforms', 'xmlns:orx': 'http://openrosa.org/xforms', 'id': 'snapshot_xml'}, 'start': '2023-03-09T16:45:05.143-08:00', 'end': '2023-03-09T16:45:05.144-08:00', 'destination_purpose': 'buy_something', 'travel_mode': 'bus', 'Total_people_in_trip_party': '2', 'Non_household_member_s_on_trip': '0', 'Vehicle_trip_Parking_location': '1', 'Parking_cost': '0', 'Total_toll_charges_p_uring_the_trip_AUD': '0', 'Transit_fees_AUD': '2.5', 'Taxi_fees': '', 'meta': {'attr': {}, 'instanceID': 'uuid:1faa14c9-7d59-4293-bbaf-6efedda67295'}}}, 'start_ts': 1673820507.3668675, 'end_ts': 1673823369.047, 'match_id': 'e566bdbb-5f6e-42cd-a3ba-ba9cb6201f8d', 'start_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 8, 'second': 27, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2023-01-15T14:08:27.366868-08:00', 'end_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 56, 'second': 9, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2023-01-15T14:56:09.047000-08:00'}}
While we have the multilabel data as the labels simply as a list of tuples.
2024-01-23 16:15:02,195:INFO:map_labels_mode: no replaced mode column found, early return
2024-01-23 16:15:02,200:DEBUG:User_label_df
mode_confirm purpose_confirm
0 shared_ride shopping
1 shared_ride shopping
2 shared_ride shopping
3 shared_ride shopping
4 shared_ride shopping
.. ... ...
348 shared_ride shopping
349 shared_ride shopping
350 shared_ride shopping
351 shared_ride shopping
352 shared_ride shopping
[353 rows x 2 columns]
2024-01-23 16:15:02,200:DEBUG:group cols : ['mode_confirm', 'purpose_confirm']
2024-01-23 16:15:02,206:DEBUG:unique_labels
mode_confirm purpose_confirm uniqcount
0 shared_ride shopping 353
So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.
If it is not a performance check, then why did you need it?
- For surveys the column will be a dict anyway and be caught by the check within the if
- For non-surveys, the column will not be a dict and will not be filtered
Basically, since you are already handling mixed inputs, it seems like the check for trip_user_input
is unnecessary unless you want to use it as a performance fix.
user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]
Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?
We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.
So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.
My comment was:
"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"
So your options are:
- You could just flag it as a TODO pending design and resolution of Unify
user_input
data model for different configurations e-mission-docs#1045 - You could use the
dict
check without the initialtrip_user_input
check - You could check to see what the keys of the dict are....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the trip_user_input
column of the data frame like df[col_name] before applying the filter: user_label_df['trip_user_input']
As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.
Once the unification is done under trip_user_input
column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.
P.S. I've modified the filtering approach as mentioned in below comments.
I identified another issue which is resulting in empty dataframes despite there being multilabel data. The data is stored like this:
But on running So, on inspecting the data, I noticed an issue wherein empty dataframes were being generated for those bins having a combination of mode_confirm, purpose_confirm and trip_user_input in the list of dictionaries of labels. To summarize, currently identified issues are:
|
To handle both the current issues, my fix involves: Solved the empty data frame issue by filtering the list of dictionaries before even converting to data frame.
This filtered list of dictionaries is used for conversion to data frame which should not break any more as the data frame should only contain the labels as columns and values as non-dict values. On testing locally for a bin and the values contained under
|
Now filtering survey inputs before dataframe itself is created by checking whether each dictionary value is a value or a nested dictionary.
Possible future issue:
While the latest filtering approach works for the current issues, it would fail when we unify the labels and store as dictionaries. I thought a possible fix for future issue could be to store the dictionaries with keys matching multilabel keys only and then convert them to a data frame. But this would again go back to the original issue of having dictionaries which breaks the code when applying data frame functions. So will need to think of this while the unification is ongoing. Also, found user input keys in this config file; might be of help if making changes at a global level. |
@MukuFlash03 you still haven't added the TODO to flag this as an area to revisit when we unify. From #954 (comment)
What if we only get to the unification after you leave? What if we only get the unification after I leave? I am happy at NREL but I am not an indentured servant. I can theoretically quit tomorrow and go back to work in industry. I don't want to hold this up further so I am adding the TODO myself. |
Squash-merging to avoid messy commit history, please pull accordingly |
This PR is for fixing the failures identified in the model building pipeline in the AWS Cloudwatch logs as highlighted in this issue.
The goal is to fix the "TypeError: unhashable type: 'dict'" error.
The reason for this was that some users had data containing survey inputs in the form of a dictionary instead of multilabels which was breaking the groupby() function applied on a dataframe.
The fix involves checking the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower.
These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.
The groupby() then works on this new filtered dataframe which should now work and build models.