Fix for model-build failure due to presence of survey inputs as a dictionary #954

MukuFlash03 · 2024-01-26T15:52:28Z

This PR is for fixing the failures identified in the model building pipeline in the AWS Cloudwatch logs as highlighted in this issue.

The goal is to fix the "TypeError: unhashable type: 'dict'" error.
The reason for this was that some users had data containing survey inputs in the form of a dictionary instead of multilabels which was breaking the groupby() function applied on a dataframe.

The fix involves checking the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower.
These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.
The groupby() then works on this new filtered dataframe which should now work and build models.

The idea is to check the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower. These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.

Added log statement to the greedy_similarity_binning, to indicate filtering is being done for the dictionary elements in the dataframe if the column is 'trip_user_input'.

shankari

This needs more thought and documentation (see below).

emission/analysis/modelling/trip_model/greedy_similarity_binning.py

shankari · 2024-01-31T23:47:09Z

emission/analysis/modelling/trip_model/greedy_similarity_binning.py

@@ -295,6 +295,10 @@ def _generate_predictions(self):
            # compute unique label sets and their probabilities in one cluster
            # 'p' refers to probability
            group_cols = user_label_df.columns.tolist()
+            # Filtering out rows from the user_label_df if they are dictionary objects which come from the survey inputs provided by the users instead of multilabels
+            if 'trip_user_input' in group_cols:


I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify multilabel and survey data so that both of them are stored as trip_user_input. That will make the rest of the code much simpler because we don't need to special case input handling in general.

It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:

"the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels."

I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

On what basis to distinguish?

I found some code in emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py.

# multi entry multilabel multi_entry_multilabel = [ {"metadata": {"key": "manual/mode_confirm", "write_ts": 1, "write_fmt_time": "1"}, "data": {"start_ts": 8, "label": "foo"}}, ...] # multi entry survey multi_entry_multilabel = [ {"metadata": {"key": "manual/trip_user_input", "write_ts": 1, "write_fmt_time": "1"}, "data": {"xmlResponse": "<foo></foo>", "start_ts": 8, "start_fmt_time": 8}}, ...]

I noticed survey inputs have these keys: manual/trip_user_input, manual/trip_addition_input.
While multilabel inputs have these ones: manual/mode_confirm, manual/purpose_confirm.

However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().

The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:

2024-01-25 19:51:29,027:DEBUG:{'_id': ObjectId('640bd0ee80ea0c11040a8def'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': {'key': 'manual/trip_user_input', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1678409128.236, 'write_local_dt': {'year': 2023, 'month': 3, 'day': 9, 'hour': 16, 'minute': 45, 'second': 28, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-03-09T16:45:28.236000-08:00'}, 'data': {'label': '1 purpose, 1 mode', 'name': 'TripConfirmSurvey', 'version': 1.2, 'xmlResponse': '<data xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="snapshot_xml">\n <start>2023-03-09T16:45:05.143-08:00</start>\n <end>2023-03-09T16:45:05.144-08:00</end>\n <destination_purpose>buy_something</destination_purpose>\n <travel_mode>bus</travel_mode>\n <Total_people_in_trip_party>2</Total_people_in_trip_party>\n <Non_household_member_s_on_trip>0</Non_household_member_s_on_trip>\n <Vehicle_trip_Parking_location>1</Vehicle_trip_Parking_location>\n <Parking_cost>0</Parking_cost>\n <Total_toll_charges_p_uring_the_trip_AUD>0</Total_toll_charges_p_uring_the_trip_AUD>\n <Transit_fees_AUD>2.5</Transit_fees_AUD>\n <Taxi_fees/>\n <meta>\n <instanceID>uuid:1faa14c9-7d59-4293-bbaf-6efedda67295</instanceID>\n </meta>\n </data>', 'jsonDocResponse': {'data': {'attr': {'xmlns:jr': 'http://openrosa.org/javarosa', 'xmlns:odk': 'http://www.opendatakit.org/xforms', 'xmlns:orx': 'http://openrosa.org/xforms', 'id': 'snapshot_xml'}, 'start': '2023-03-09T16:45:05.143-08:00', 'end': '2023-03-09T16:45:05.144-08:00', 'destination_purpose': 'buy_something', 'travel_mode': 'bus', 'Total_people_in_trip_party': '2', 'Non_household_member_s_on_trip': '0', 'Vehicle_trip_Parking_location': '1', 'Parking_cost': '0', 'Total_toll_charges_p_uring_the_trip_AUD': '0', 'Transit_fees_AUD': '2.5', 'Taxi_fees': '', 'meta': {'attr': {}, 'instanceID': 'uuid:1faa14c9-7d59-4293-bbaf-6efedda67295'}}}, 'start_ts': 1673820507.3668675, 'end_ts': 1673823369.047, 'match_id': 'e566bdbb-5f6e-42cd-a3ba-ba9cb6201f8d', 'start_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 8, 'second': 27, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2023-01-15T14:08:27.366868-08:00', 'end_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 56, 'second': 9, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2023-01-15T14:56:09.047000-08:00'}}

While we have the multilabel data as the labels simply as a list of tuples.

2024-01-23 16:15:02,195:INFO:map_labels_mode: no replaced mode column found, early return 2024-01-23 16:15:02,200:DEBUG:User_label_df mode_confirm purpose_confirm 0 shared_ride shopping 1 shared_ride shopping 2 shared_ride shopping 3 shared_ride shopping 4 shared_ride shopping .. ... ... 348 shared_ride shopping 349 shared_ride shopping 350 shared_ride shopping 351 shared_ride shopping 352 shared_ride shopping [353 rows x 2 columns] 2024-01-23 16:15:02,200:DEBUG:group cols : ['mode_confirm', 'purpose_confirm'] 2024-01-23 16:15:02,206:DEBUG:unique_labels mode_confirm purpose_confirm uniqcount 0 shared_ride shopping 353

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

If it is not a performance check, then why did you need it?

For surveys the column will be a dict anyway and be caught by the check within the if

For non-surveys, the column will not be a dict and will not be filtered

Basically, since you are already handling mixed inputs, it seems like the check for trip_user_input is unnecessary unless you want to use it as a performance fix.

user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

My comment was:

"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"

So your options are:

You could just flag it as a TODO pending design and resolution of Unify user_input data model for different configurations e-mission-docs#1045

You could use the dict check without the initial trip_user_input check

You could check to see what the keys of the dict are....

I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the trip_user_input column of the data frame like df[col_name] before applying the filter: user_label_df['trip_user_input']

As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.

Once the unification is done under trip_user_input column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.

P.S. I've modified the filtering approach as mentioned in below comments.

MukuFlash03 · 2024-02-08T03:35:30Z

I identified another issue which is resulting in empty dataframes despite there being multilabel data.

The data is stored like this: 

{"mode_confirm": <non-dict string value>, "purpose_confirm": <non-dict string value}, {"trip_user_input": <dict value>}

But on running pd.Dataframe(bin_record[‘labels’]), an empty data frame was being generated.
I’m assuming, the data frame generation fails internally on getting a nested dictionary as the value for a column value, in this case, for column ‘trip_user_input’.

So, on inspecting the data, I noticed an issue wherein empty dataframes were being generated for those bins having a combination of mode_confirm, purpose_confirm and trip_user_input in the list of dictionaries of labels.
When I logged the actual data contained before it was being converted to data frames, I saw that data was present in the bins in bin_records[‘labels’]

To summarize, currently identified issues are:

Trip_user_input has nested dictionary values which are failing with pandas data frame functions.
Empty dataframes generated where the bins_records[‘labels’] have both multi labels and surveys in the list of dictionaries.

MukuFlash03 · 2024-02-08T03:43:23Z

To handle both the current issues, my fix involves:

Solved the empty data frame issue by filtering the list of dictionaries before even converting to data frame.
Here, I check for any nested dictionaries and skip them while I add non-nested dictionaries corresponding to multilabels to the filtered list.

filtered_label_dicts = [label_dict for label_dict in bin_record['labels'] if not any(isinstance(x, dict) for x in label_dict.values())]

This filtered list of dictionaries is used for conversion to data frame which should not break any more as the data frame should only contain the labels as columns and values as non-dict values.

On testing locally for a bin and the values contained under bin_records[‘labels’]:

A = 2872 - Total number of entries
B = 2593 - Total number of filtered entries with only mode_confirm, purpose confirm
C = 279  - Total number of entries with trip_user_input containing survey responses (nested dictionary values along with XML)
A = B + C

Now filtering survey inputs before dataframe itself is created by checking whether each dictionary value is a value or a nested dictionary.

MukuFlash03 · 2024-02-08T03:50:11Z

Possible future issue:

Proposed changes mentioned here state that the multi labels would be converted into dictionaries as well, in which case the current fix for ignoring dictionaries would fail as both multi labels and surveys would be ignored.

While the latest filtering approach works for the current issues, it would fail when we unify the labels and store as dictionaries.

I thought a possible fix for future issue could be to store the dictionaries with keys matching multilabel keys only and then convert them to a data frame.
This could be done by separating multilabel keys = [‘mode_confirm’, ‘purpose_confirm’] and survey_keys = [‘trip_user_input’] or after the proposed changes mentioned here: [‘TripConfirmSurvey’]

But this would again go back to the original issue of having dictionaries which breaks the code when applying data frame functions.

So will need to think of this while the unification is ongoing.

Also, found user input keys in this config file; might be of help if making changes at a global level.
conf/analysis/debug.conf.json.sample

shankari · 2024-02-10T04:52:14Z

@MukuFlash03 you still haven't added the TODO to flag this as an area to revisit when we unify.

From #954 (comment)

You could just flag it as a TODO pending design and resolution of e-mission/e-mission-docs#1045

What if we only get to the unification after you leave? What if we only get the unification after I leave? I am happy at NREL but I am not an indentured servant. I can theoretically quit tomorrow and go back to work in industry.

I don't want to hold this up further so I am adding the TODO myself.

shankari · 2024-02-10T05:19:19Z

Squash-merging to avoid messy commit history, please pull accordingly

MukuFlash03 mentioned this pull request Jan 26, 2024

Testing on staging for loading models and improving scalability e-mission/e-mission-docs#1039

Open

Added log statements to indicate dataframe filtering done

a46620b

Added log statement to the greedy_similarity_binning, to indicate filtering is being done for the dictionary elements in the dataframe if the column is 'trip_user_input'.

shankari requested changes Jan 31, 2024

View reviewed changes

Modified filtering of survey inputs

6172321

Now filtering survey inputs before dataframe itself is created by checking whether each dictionary value is a value or a nested dictionary.

Add TODO so I can merge this

ea50d8b

shankari approved these changes Feb 10, 2024

View reviewed changes

shankari merged commit 174bfb1 into e-mission:master Feb 10, 2024
5 checks passed

MukuFlash03 deleted the model-build-fail branch April 18, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Fix for model-build failure due to presence of survey inputs as a dictionary #954

MukuFlash03 commented Jan 26, 2024

shankari left a comment

shankari Jan 31, 2024

MukuFlash03 Feb 2, 2024 •

edited

Loading

MukuFlash03 Feb 2, 2024

shankari Feb 7, 2024

MukuFlash03 Feb 8, 2024

MukuFlash03 commented Feb 8, 2024 •

edited

Loading

MukuFlash03 commented Feb 8, 2024 •

edited

Loading

MukuFlash03 commented Feb 8, 2024 •

edited

Loading

shankari commented Feb 10, 2024

shankari commented Feb 10, 2024

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Conversation

MukuFlash03 commented Jan 26, 2024

shankari left a comment

Choose a reason for hiding this comment

shankari Jan 31, 2024

Choose a reason for hiding this comment

MukuFlash03 Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

MukuFlash03 Feb 2, 2024

Choose a reason for hiding this comment

shankari Feb 7, 2024

Choose a reason for hiding this comment

MukuFlash03 Feb 8, 2024

Choose a reason for hiding this comment

MukuFlash03 commented Feb 8, 2024 • edited Loading

MukuFlash03 commented Feb 8, 2024 • edited Loading

MukuFlash03 commented Feb 8, 2024 • edited Loading

shankari commented Feb 10, 2024

shankari commented Feb 10, 2024

MukuFlash03 Feb 2, 2024 •

edited

Loading

MukuFlash03 commented Feb 8, 2024 •

edited

Loading

MukuFlash03 commented Feb 8, 2024 •

edited

Loading

MukuFlash03 commented Feb 8, 2024 •

edited

Loading