Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Merged
merged 4 commits into from
Feb 10, 2024

Conversation

MukuFlash03
Copy link
Contributor

This PR is for fixing the failures identified in the model building pipeline in the AWS Cloudwatch logs as highlighted in this issue.

The goal is to fix the "TypeError: unhashable type: 'dict'" error.
The reason for this was that some users had data containing survey inputs in the form of a dictionary instead of multilabels which was breaking the groupby() function applied on a dataframe.

The fix involves checking the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower.
These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.
The groupby() then works on this new filtered dataframe which should now work and build models.

The idea is to check the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower.

These rows are then filtered out of the original dataframe leaving behind only the non-dict rows.
Added log statement to the greedy_similarity_binning, to indicate filtering is being done for the dictionary elements in the dataframe if the column is 'trip_user_input'.
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more thought and documentation (see below).

@@ -295,6 +295,10 @@ def _generate_predictions(self):
# compute unique label sets and their probabilities in one cluster
# 'p' refers to probability
group_cols = user_label_df.columns.tolist()
# Filtering out rows from the user_label_df if they are dictionary objects which come from the survey inputs provided by the users instead of multilabels
if 'trip_user_input' in group_cols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify multilabel and survey data so that both of them are stored as trip_user_input. That will make the rest of the code much simpler because we don't need to special case input handling in general.

It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:

"the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels."

I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:

  1. Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

  1. On what basis to distinguish?

I found some code in emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py.

        # multi entry multilabel
        multi_entry_multilabel = [
            {"metadata": {"key": "manual/mode_confirm", "write_ts": 1, "write_fmt_time": "1"},
                "data": {"start_ts": 8, "label": "foo"}},
       ...]

        # multi entry survey
        multi_entry_multilabel = [
            {"metadata": {"key": "manual/trip_user_input", "write_ts": 1, "write_fmt_time": "1"},
                "data": {"xmlResponse": "<foo></foo>", "start_ts": 8, "start_fmt_time": 8}},
        ...]

I noticed survey inputs have these keys: manual/trip_user_input, manual/trip_addition_input.
While multilabel inputs have these ones: manual/mode_confirm, manual/purpose_confirm.

However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:

2024-01-25 19:51:29,027:DEBUG:{'_id': ObjectId('640bd0ee80ea0c11040a8def'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': {'key': 'manual/trip_user_input', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1678409128.236, 'write_local_dt': {'year': 2023, 'month': 3, 'day': 9, 'hour': 16, 'minute': 45, 'second': 28, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-03-09T16:45:28.236000-08:00'}, 'data': {'label': '1 purpose, 1 mode', 'name': 'TripConfirmSurvey', 'version': 1.2, 'xmlResponse': '<data xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="snapshot_xml">\n          <start>2023-03-09T16:45:05.143-08:00</start>\n          <end>2023-03-09T16:45:05.144-08:00</end>\n          <destination_purpose>buy_something</destination_purpose>\n          <travel_mode>bus</travel_mode>\n          <Total_people_in_trip_party>2</Total_people_in_trip_party>\n          <Non_household_member_s_on_trip>0</Non_household_member_s_on_trip>\n          <Vehicle_trip_Parking_location>1</Vehicle_trip_Parking_location>\n          <Parking_cost>0</Parking_cost>\n          <Total_toll_charges_p_uring_the_trip_AUD>0</Total_toll_charges_p_uring_the_trip_AUD>\n          <Transit_fees_AUD>2.5</Transit_fees_AUD>\n          <Taxi_fees/>\n          <meta>\n            <instanceID>uuid:1faa14c9-7d59-4293-bbaf-6efedda67295</instanceID>\n          </meta>\n        </data>', 'jsonDocResponse': {'data': {'attr': {'xmlns:jr': 'http://openrosa.org/javarosa', 'xmlns:odk': 'http://www.opendatakit.org/xforms', 'xmlns:orx': 'http://openrosa.org/xforms', 'id': 'snapshot_xml'}, 'start': '2023-03-09T16:45:05.143-08:00', 'end': '2023-03-09T16:45:05.144-08:00', 'destination_purpose': 'buy_something', 'travel_mode': 'bus', 'Total_people_in_trip_party': '2', 'Non_household_member_s_on_trip': '0', 'Vehicle_trip_Parking_location': '1', 'Parking_cost': '0', 'Total_toll_charges_p_uring_the_trip_AUD': '0', 'Transit_fees_AUD': '2.5', 'Taxi_fees': '', 'meta': {'attr': {}, 'instanceID': 'uuid:1faa14c9-7d59-4293-bbaf-6efedda67295'}}}, 'start_ts': 1673820507.3668675, 'end_ts': 1673823369.047, 'match_id': 'e566bdbb-5f6e-42cd-a3ba-ba9cb6201f8d', 'start_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 8, 'second': 27, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2023-01-15T14:08:27.366868-08:00', 'end_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 56, 'second': 9, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2023-01-15T14:56:09.047000-08:00'}}

While we have the multilabel data as the labels simply as a list of tuples.

2024-01-23 16:15:02,195:INFO:map_labels_mode: no replaced mode column found, early return
2024-01-23 16:15:02,200:DEBUG:User_label_df 
     mode_confirm purpose_confirm
0    shared_ride        shopping
1    shared_ride        shopping
2    shared_ride        shopping
3    shared_ride        shopping
4    shared_ride        shopping
..           ...             ...
348  shared_ride        shopping
349  shared_ride        shopping
350  shared_ride        shopping
351  shared_ride        shopping
352  shared_ride        shopping

[353 rows x 2 columns] 

2024-01-23 16:15:02,200:DEBUG:group cols : ['mode_confirm', 'purpose_confirm'] 

2024-01-23 16:15:02,206:DEBUG:unique_labels 
   mode_confirm purpose_confirm  uniqcount
0  shared_ride        shopping        353 

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

If it is not a performance check, then why did you need it?

  • For surveys the column will be a dict anyway and be caught by the check within the if
  • For non-surveys, the column will not be a dict and will not be filtered

Basically, since you are already handling mixed inputs, it seems like the check for trip_user_input is unnecessary unless you want to use it as a performance fix.

user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

My comment was:

"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"

So your options are:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the trip_user_input column of the data frame like df[col_name] before applying the filter: user_label_df['trip_user_input']

As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.

Once the unification is done under trip_user_input column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.


P.S. I've modified the filtering approach as mentioned in below comments.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Feb 8, 2024

I identified another issue which is resulting in empty dataframes despite there being multilabel data.


The data is stored like this:


{"mode_confirm": <non-dict string value>, "purpose_confirm": <non-dict string value}, {"trip_user_input": <dict value>}

But on running pd.Dataframe(bin_record[‘labels’]), an empty data frame was being generated.
I’m assuming, the data frame generation fails internally on getting a nested dictionary as the value for a column value, in this case, for column ‘trip_user_input’.

So, on inspecting the data, I noticed an issue wherein empty dataframes were being generated for those bins having a combination of mode_confirm, purpose_confirm and trip_user_input in the list of dictionaries of labels.
When I logged the actual data contained before it was being converted to data frames, I saw that data was present in the bins in bin_records[‘labels’]


To summarize, currently identified issues are:

  • Trip_user_input has nested dictionary values which are failing with pandas data frame functions.
  • Empty dataframes generated where the bins_records[‘labels’] have both multi labels and surveys in the list of dictionaries.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Feb 8, 2024

To handle both the current issues, my fix involves:

Solved the empty data frame issue by filtering the list of dictionaries before even converting to data frame.
Here, I check for any nested dictionaries and skip them while I add non-nested dictionaries corresponding to multilabels to the filtered list.

filtered_label_dicts = [label_dict for label_dict in bin_record['labels'] if not any(isinstance(x, dict) for x in label_dict.values())]            

This filtered list of dictionaries is used for conversion to data frame which should not break any more as the data frame should only contain the labels as columns and values as non-dict values.

On testing locally for a bin and the values contained under bin_records[‘labels’]:

A = 2872 - Total number of entries
B = 2593 - Total number of filtered entries with only mode_confirm, purpose confirm
C = 279  - Total number of entries with trip_user_input containing survey responses (nested dictionary values along with XML)
A = B + C

Now filtering survey inputs before dataframe itself is created by checking whether each dictionary value is a value or a nested dictionary.
@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Feb 8, 2024

Possible future issue:

  • Proposed changes mentioned here state that the multi labels would be converted into dictionaries as well, in which case the current fix for ignoring dictionaries would fail as both multi labels and surveys would be ignored.

While the latest filtering approach works for the current issues, it would fail when we unify the labels and store as dictionaries.

I thought a possible fix for future issue could be to store the dictionaries with keys matching multilabel keys only and then convert them to a data frame.
This could be done by separating multilabel keys = [‘mode_confirm’, ‘purpose_confirm’] and survey_keys = [‘trip_user_input’] or after the proposed changes mentioned here: [‘TripConfirmSurvey’]

But this would again go back to the original issue of having dictionaries which breaks the code when applying data frame functions.

So will need to think of this while the unification is ongoing.


Also, found user input keys in this config file; might be of help if making changes at a global level.
conf/analysis/debug.conf.json.sample

@shankari
Copy link
Contributor

@MukuFlash03 you still haven't added the TODO to flag this as an area to revisit when we unify.

From #954 (comment)

You could just flag it as a TODO pending design and resolution of e-mission/e-mission-docs#1045

What if we only get to the unification after you leave? What if we only get the unification after I leave? I am happy at NREL but I am not an indentured servant. I can theoretically quit tomorrow and go back to work in industry.

I don't want to hold this up further so I am adding the TODO myself.

@shankari
Copy link
Contributor

Squash-merging to avoid messy commit history, please pull accordingly

@shankari shankari merged commit 174bfb1 into e-mission:master Feb 10, 2024
5 checks passed
@MukuFlash03 MukuFlash03 deleted the model-build-fail branch April 18, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants