Fix for model-build failure due to presence of survey inputs as a dictionary #954

shankari · 2024-01-31T23:47:09Z

I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify multilabel and survey data so that both of them are stored as trip_user_input. That will make the rest of the code much simpler because we don't need to special case input handling in general.

It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:

"the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels."

I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

On what basis to distinguish?

I found some code in emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py.

# multi entry multilabel multi_entry_multilabel = [ {"metadata": {"key": "manual/mode_confirm", "write_ts": 1, "write_fmt_time": "1"}, "data": {"start_ts": 8, "label": "foo"}}, ...] # multi entry survey multi_entry_multilabel = [ {"metadata": {"key": "manual/trip_user_input", "write_ts": 1, "write_fmt_time": "1"}, "data": {"xmlResponse": "<foo></foo>", "start_ts": 8, "start_fmt_time": 8}}, ...]

I noticed survey inputs have these keys: manual/trip_user_input, manual/trip_addition_input.
While multilabel inputs have these ones: manual/mode_confirm, manual/purpose_confirm.

However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().

The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:

2024-01-25 19:51:29,027:DEBUG:{'_id': ObjectId('640bd0ee80ea0c11040a8def'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': {'key': 'manual/trip_user_input', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1678409128.236, 'write_local_dt': {'year': 2023, 'month': 3, 'day': 9, 'hour': 16, 'minute': 45, 'second': 28, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-03-09T16:45:28.236000-08:00'}, 'data': {'label': '1 purpose, 1 mode', 'name': 'TripConfirmSurvey', 'version': 1.2, 'xmlResponse': '<data xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="snapshot_xml">\n <start>2023-03-09T16:45:05.143-08:00</start>\n <end>2023-03-09T16:45:05.144-08:00</end>\n <destination_purpose>buy_something</destination_purpose>\n <travel_mode>bus</travel_mode>\n <Total_people_in_trip_party>2</Total_people_in_trip_party>\n <Non_household_member_s_on_trip>0</Non_household_member_s_on_trip>\n <Vehicle_trip_Parking_location>1</Vehicle_trip_Parking_location>\n <Parking_cost>0</Parking_cost>\n <Total_toll_charges_p_uring_the_trip_AUD>0</Total_toll_charges_p_uring_the_trip_AUD>\n <Transit_fees_AUD>2.5</Transit_fees_AUD>\n <Taxi_fees/>\n <meta>\n <instanceID>uuid:1faa14c9-7d59-4293-bbaf-6efedda67295</instanceID>\n </meta>\n </data>', 'jsonDocResponse': {'data': {'attr': {'xmlns:jr': 'http://openrosa.org/javarosa', 'xmlns:odk': 'http://www.opendatakit.org/xforms', 'xmlns:orx': 'http://openrosa.org/xforms', 'id': 'snapshot_xml'}, 'start': '2023-03-09T16:45:05.143-08:00', 'end': '2023-03-09T16:45:05.144-08:00', 'destination_purpose': 'buy_something', 'travel_mode': 'bus', 'Total_people_in_trip_party': '2', 'Non_household_member_s_on_trip': '0', 'Vehicle_trip_Parking_location': '1', 'Parking_cost': '0', 'Total_toll_charges_p_uring_the_trip_AUD': '0', 'Transit_fees_AUD': '2.5', 'Taxi_fees': '', 'meta': {'attr': {}, 'instanceID': 'uuid:1faa14c9-7d59-4293-bbaf-6efedda67295'}}}, 'start_ts': 1673820507.3668675, 'end_ts': 1673823369.047, 'match_id': 'e566bdbb-5f6e-42cd-a3ba-ba9cb6201f8d', 'start_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 8, 'second': 27, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2023-01-15T14:08:27.366868-08:00', 'end_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 56, 'second': 9, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2023-01-15T14:56:09.047000-08:00'}}

While we have the multilabel data as the labels simply as a list of tuples.

2024-01-23 16:15:02,195:INFO:map_labels_mode: no replaced mode column found, early return 2024-01-23 16:15:02,200:DEBUG:User_label_df mode_confirm purpose_confirm 0 shared_ride shopping 1 shared_ride shopping 2 shared_ride shopping 3 shared_ride shopping 4 shared_ride shopping .. ... ... 348 shared_ride shopping 349 shared_ride shopping 350 shared_ride shopping 351 shared_ride shopping 352 shared_ride shopping [353 rows x 2 columns] 2024-01-23 16:15:02,200:DEBUG:group cols : ['mode_confirm', 'purpose_confirm'] 2024-01-23 16:15:02,206:DEBUG:unique_labels mode_confirm purpose_confirm uniqcount 0 shared_ride shopping 353

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

If it is not a performance check, then why did you need it?

For surveys the column will be a dict anyway and be caught by the check within the if

For non-surveys, the column will not be a dict and will not be filtered

Basically, since you are already handling mixed inputs, it seems like the check for trip_user_input is unnecessary unless you want to use it as a performance fix.

user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

My comment was:

"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"

So your options are:

You could just flag it as a TODO pending design and resolution of Unify user_input data model for different configurations e-mission-docs#1045

You could use the dict check without the initial trip_user_input check

You could check to see what the keys of the dict are....

I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the trip_user_input column of the data frame like df[col_name] before applying the filter: user_label_df['trip_user_input']

As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.

Once the unification is done under trip_user_input column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.

P.S. I've modified the filtering approach as mentioned in below comments.

-Original file line number
+Diff line change
@@ Expand Up / @@ -295,6 +295,10 @@ def _generate_predictions(self): @@
                 # compute unique label sets and their probabilities in one cluster
                 # 'p' refers to probability
                 group_cols = user_label_df.columns.tolist()
+                # Filtering out rows from the user_label_df if they are dictionary objects which come from the survey inputs provided by the users instead of multilabels
+                if 'trip_user_input' in group_cols:
+                    logging.debug("Filtering out any dictionary rows from the dataframe provided as survey inputs")
+                    user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]
                 unique_labels = user_label_df.groupby(group_cols).size().reset_index(name='uniqcount')
                 unique_labels['p'] = unique_labels.uniqcount / sum_trips
                 labels_columns = user_label_df.columns.to_list()
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Diff view

Diff view

There are no files selected for viewing

shankari Jan 31, 2024

MukuFlash03 Feb 2, 2024 •

edited

Loading

MukuFlash03 Feb 2, 2024

shankari Feb 7, 2024

MukuFlash03 Feb 8, 2024

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Diff view

Diff view

There are no files selected for viewing

shankari Jan 31, 2024

Choose a reason for hiding this comment

MukuFlash03 Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

MukuFlash03 Feb 2, 2024

Choose a reason for hiding this comment

shankari Feb 7, 2024

Choose a reason for hiding this comment

MukuFlash03 Feb 8, 2024

Choose a reason for hiding this comment

MukuFlash03 Feb 2, 2024 •

edited

Loading