Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for model-build failure due to presence of survey inputs as a dictionary #954

Merged
merged 4 commits into from
Feb 10, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,10 @@ def _generate_predictions(self):
# compute unique label sets and their probabilities in one cluster
# 'p' refers to probability
group_cols = user_label_df.columns.tolist()
# Filtering out rows from the user_label_df if they are dictionary objects which come from the survey inputs provided by the users instead of multilabels
if 'trip_user_input' in group_cols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this is a performance check, but you should clarify why you are doing it. Note also that there is a long-term plan to unify multilabel and survey data so that both of them are stored as trip_user_input. That will make the rest of the code much simpler because we don't need to special case input handling in general.

It is OK to not handle that now, since we haven't yet identified exactly what that would look like.
But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

With regards the difference between multilabel and survey, I looked up for documentation but didn't really find much info - either in the codebase or on GitHub.
There's one I found from your explanation in Satyam's work on "survey assist" in this issue:

"the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels."

I understand we do plan to combine multilabel and survey under one roof, but I have these queries mainly:

  1. Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

  1. On what basis to distinguish?

I found some code in emission/tests/analysisTests/userInputTests/TestUserInputFakeData.py.

        # multi entry multilabel
        multi_entry_multilabel = [
            {"metadata": {"key": "manual/mode_confirm", "write_ts": 1, "write_fmt_time": "1"},
                "data": {"start_ts": 8, "label": "foo"}},
       ...]

        # multi entry survey
        multi_entry_multilabel = [
            {"metadata": {"key": "manual/trip_user_input", "write_ts": 1, "write_fmt_time": "1"},
                "data": {"xmlResponse": "<foo></foo>", "start_ts": 8, "start_fmt_time": 8}},
        ...]

I noticed survey inputs have these keys: manual/trip_user_input, manual/trip_addition_input.
While multilabel inputs have these ones: manual/mode_confirm, manual/purpose_confirm.

However, I see that this is raw data and not processed data present in bins, dataframes or in the part of the model build pipeline in generate_predictions().


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I say this is because at this stage survey inputs are still reaching this stage as a list of dictionary with XML data:

2024-01-25 19:51:29,027:DEBUG:{'_id': ObjectId('640bd0ee80ea0c11040a8def'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': {'key': 'manual/trip_user_input', 'platform': 'android', 'read_ts': 0, 'time_zone': 'America/Los_Angeles', 'type': 'message', 'write_ts': 1678409128.236, 'write_local_dt': {'year': 2023, 'month': 3, 'day': 9, 'hour': 16, 'minute': 45, 'second': 28, 'weekday': 3, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-03-09T16:45:28.236000-08:00'}, 'data': {'label': '1 purpose, 1 mode', 'name': 'TripConfirmSurvey', 'version': 1.2, 'xmlResponse': '<data xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="snapshot_xml">\n          <start>2023-03-09T16:45:05.143-08:00</start>\n          <end>2023-03-09T16:45:05.144-08:00</end>\n          <destination_purpose>buy_something</destination_purpose>\n          <travel_mode>bus</travel_mode>\n          <Total_people_in_trip_party>2</Total_people_in_trip_party>\n          <Non_household_member_s_on_trip>0</Non_household_member_s_on_trip>\n          <Vehicle_trip_Parking_location>1</Vehicle_trip_Parking_location>\n          <Parking_cost>0</Parking_cost>\n          <Total_toll_charges_p_uring_the_trip_AUD>0</Total_toll_charges_p_uring_the_trip_AUD>\n          <Transit_fees_AUD>2.5</Transit_fees_AUD>\n          <Taxi_fees/>\n          <meta>\n            <instanceID>uuid:1faa14c9-7d59-4293-bbaf-6efedda67295</instanceID>\n          </meta>\n        </data>', 'jsonDocResponse': {'data': {'attr': {'xmlns:jr': 'http://openrosa.org/javarosa', 'xmlns:odk': 'http://www.opendatakit.org/xforms', 'xmlns:orx': 'http://openrosa.org/xforms', 'id': 'snapshot_xml'}, 'start': '2023-03-09T16:45:05.143-08:00', 'end': '2023-03-09T16:45:05.144-08:00', 'destination_purpose': 'buy_something', 'travel_mode': 'bus', 'Total_people_in_trip_party': '2', 'Non_household_member_s_on_trip': '0', 'Vehicle_trip_Parking_location': '1', 'Parking_cost': '0', 'Total_toll_charges_p_uring_the_trip_AUD': '0', 'Transit_fees_AUD': '2.5', 'Taxi_fees': '', 'meta': {'attr': {}, 'instanceID': 'uuid:1faa14c9-7d59-4293-bbaf-6efedda67295'}}}, 'start_ts': 1673820507.3668675, 'end_ts': 1673823369.047, 'match_id': 'e566bdbb-5f6e-42cd-a3ba-ba9cb6201f8d', 'start_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 8, 'second': 27, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2023-01-15T14:08:27.366868-08:00', 'end_local_dt': {'year': 2023, 'month': 1, 'day': 15, 'hour': 14, 'minute': 56, 'second': 9, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'end_fmt_time': '2023-01-15T14:56:09.047000-08:00'}}

While we have the multilabel data as the labels simply as a list of tuples.

2024-01-23 16:15:02,195:INFO:map_labels_mode: no replaced mode column found, early return
2024-01-23 16:15:02,200:DEBUG:User_label_df 
     mode_confirm purpose_confirm
0    shared_ride        shopping
1    shared_ride        shopping
2    shared_ride        shopping
3    shared_ride        shopping
4    shared_ride        shopping
..           ...             ...
348  shared_ride        shopping
349  shared_ride        shopping
350  shared_ride        shopping
351  shared_ride        shopping
352  shared_ride        shopping

[353 rows x 2 columns] 

2024-01-23 16:15:02,200:DEBUG:group cols : ['mode_confirm', 'purpose_confirm'] 

2024-01-23 16:15:02,206:DEBUG:unique_labels 
   mode_confirm purpose_confirm  uniqcount
0  shared_ride        shopping        353 

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I saw that the survey inputs were being stored in trip_user_input, I decided to run the filter operation only on dataframes containing those inputs. So I didn't mean for it to be a performance check, but might be helpful.

If it is not a performance check, then why did you need it?

  • For surveys the column will be a dict anyway and be caught by the check within the if
  • For non-surveys, the column will not be a dict and will not be filtered

Basically, since you are already handling mixed inputs, it seems like the check for trip_user_input is unnecessary unless you want to use it as a performance fix.

user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]

Would the dict type entries for survey inputs be considered for building models in the future or will they need to be ignored in the future too (in which case the filtering still works) ?

We will investigate enketo survey-based label assist models once @humbleOldSage is done with switching the label inputs to random forest.

So, I can't really see how to check whether we have multilabels or survey inputs if we do have them both under 'trip_user_input'.

My comment was:

"But we should flag a TODO here to handle it in the future, or handle it right now by a more principled check to distinguish between survey and multilabel"

So your options are:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean now, with regards to the if check for the column name.
However, the way I’m filtering the values is by accessing the trip_user_input column of the data frame like df[col_name] before applying the filter: user_label_df['trip_user_input']

As the current code stands and inspecting the data coming in the dataframe, not all bins have the trip_user_input column, some have only mode_confirm, purpose_confirm as columns.
Without the initial check it raises a key error if the column doesn’t exist.

Once the unification is done under trip_user_input column, then yes, removing the if check would be alright, as we can be assured that the ‘trip_user_input’ column (or whatever unified name we decided on) will definitely exist in all data frames.


P.S. I've modified the filtering approach as mentioned in below comments.

logging.debug("Filtering out any dictionary rows from the dataframe provided as survey inputs")
user_label_df = user_label_df.loc[user_label_df['trip_user_input'].apply(lambda x: not isinstance(x, dict))]
shankari marked this conversation as resolved.
Show resolved Hide resolved
unique_labels = user_label_df.groupby(group_cols).size().reset_index(name='uniqcount')
unique_labels['p'] = unique_labels.uniqcount / sum_trips
labels_columns = user_label_df.columns.to_list()
Expand Down
Loading