Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing on staging for loading models and improving scalability #1039

Open
MukuFlash03 opened this issue Dec 20, 2023 · 14 comments
Open

Testing on staging for loading models and improving scalability #1039

MukuFlash03 opened this issue Dec 20, 2023 · 14 comments

Comments

@MukuFlash03
Copy link
Contributor

Notes on Testing on Staging

*** Moving stuff over from this issue + added more findings ***


I got to know that staging environment that we have set up internally isn't something you can just execute and test.
It's already been executed and we can observe the logs directly to see if things are working properly.

I set the time frame for searching from: 12/03 to 12/16
Since, this implementation included the model loading for the Label inference pipeline stage, I figured the correct cloudwatch logs to observe would be: openpath-stage-analysis-intake

It looks like model isn't loaded for the two users present in the time range checked from 12/03 to 12/16.
Additionally, some of my print statements are not being logged. I am not sure whether that's because I used print() instead of logging.debug() since the pipeline stage messages with (***) are also logged using print().


@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Dec 20, 2023

A. Print statements not seen in AWS Cloudwatch Logs but visible locally

Here, I tried searching for the print statements I had added for model loading time, however, I did not find those suggesting that the model loading was not executed:

  1. Line 171 in emission.pipeline.intake_stage.py: "inferring labels"
    This is printed using print() statements and is seen in the logs.
    These are the two unique users for whom LABEL INFERENCE stage is conducted and do not have a model.
Screen Shot 2023-12-16 at 7 49 45 PM
  1. Line 56 in eam.trip_model.model_storage.py: "no GREEDY_SIMILARITY_BINNING model found for user ..."
Screen Shot 2023-12-16 at 7 49 29 PM
  1. Line 187 in eacil.inferrers.py: "In predict_cluster_confidence_discounting: .... returning as-is"
    This is printed using logging.debug() statements and is seen in the logs.
    This is printed when no model data is found.
Screen Shot 2023-12-16 at 7 48 41 PM
  1. Line 181 in eacil.inferrers.py: "Inside predict_labels_n: ... Model load time"
    This is printed using print() statements and supposed to show up but I cannot see it in the logs.
Screen Shot 2023-12-16 at 8 00 21 PM

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Dec 20, 2023

I did some more testing by re-running tests locally and running model load mechanism for debugging why statements are not being printed.

I had the following observations:

  1. I can see the print statements logged using simple print() in my local system:
2023-12-18T18:05:01.322225-07:00**********UUID 44a5d9a6-90eb-4dec-a9a9-44005f19951b: inferring labels**********
2023-12-18T18:05:01.334133-07:00 Inside predict_cluster_confidence_discounting: Model load time = 0.00089900000000398
2023-12-18T18:05:01.334184-07:00 Inside predict_labels_n: Predicting...
2023-12-18T18:05:01.334364-07:00 Inside predict_labels_n: Predictions complete for trip_list in time = 0.00017300000001796434
2023-12-18T18:05:01.369309-07:00**********UUID 44a5d9a6-90eb-4dec-a9a9-44005f19951b: populating expectations**********

There is no if-else block, or conditional block that these print statements are a part of, hence I am unclear why I cannot observe them in the logs.

In the next observation, I tried replacing print() with logging.debug() which does not print anything at all specific to the statements I added.


  1. The debug statements are not logged when I am using logging.debug()
2023-12-18T15:19:19.407894-07:00**********UUID be840053-5620-4207-b9ac-92cbaa28f27a: inferring transportation mode**********
2023-12-18T15:19:33.574887-07:00**********UUID be840053-5620-4207-b9ac-92cbaa28f27a: inferring labels**********
2023-12-18T15:19:33.620778-07:00**********UUID be840053-5620-4207-b9ac-92cbaa28f27a: populating expectations**********

This is strange since in the same file: eacil.inferrers.py , there is another set of statements which are logged while the ones I added are not.

    model = eamur._load_stored_trip_model(user_id, model_type, model_storage)

    // Does not get logged; below log statements in the for loop however are logged.
    logging.debug(f"{arrow.now()} Inside predict_cluster_confidence_discounting: Model load time = {time.process_time() - start_model_load_time}")

    labels_n_list = eamur.predict_labels_with_n(trip_list, model)
    predictions_list = []
    for labels, n in labels_n_list:
        if n <= 0:  # No model data or trip didn't match a cluster
            logging.debug(f"In predict_cluster_confidence_discounting: n={n}; returning as-is")
            predictions_list.append(labels)
            continue
        confidence_coeff = n_to_confidence_coeff(n, max_confidence, first_confidence, confidence_multiplier)
        logging.debug(f"In predict_cluster_confidence_discounting: n={n}; discounting with coefficient {confidence_coeff}")

@MukuFlash03
Copy link
Contributor Author

B. Exception while building / loading model

Found some exceptions when checking the openpath-stage-analysis-build-trip-model log group.
I believe this deals with building / storing trip model in eamur.run_model.py


  | 2023-12-16T18:00:17.802-07:00 | Traceback (most recent call last):
  | 2023-12-16T18:00:17.802-07:00 | File "/usr/src/app/emission/analysis/modelling/trip_model/run_model.py", line 77, in update_trip_model
  | 2023-12-16T18:00:17.802-07:00 | model.fit(trips)
  | 2023-12-16T18:00:17.802-07:00 | File "/usr/src/app/emission/analysis/modelling/trip_model/greedy_similarity_binning.py", line 146, in fit
  | 2023-12-16T18:00:17.803-07:00 | self._generate_predictions()
  | 2023-12-16T18:00:17.803-07:00 | File "/usr/src/app/emission/analysis/modelling/trip_model/greedy_similarity_binning.py", line 298, in _generate_predictions
  | 2023-12-16T18:00:17.803-07:00 | unique_labels = user_label_df.groupby(group_cols).size().reset_index(name='uniqcount')
  | 2023-12-16T18:00:17.803-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2420, in size
  | 2023-12-16T18:00:17.803-07:00 | result = self.grouper.size()
  | 2023-12-16T18:00:17.803-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 912, in size
  | 2023-12-16T18:00:17.803-07:00 | ids, _, ngroups = self.group_info
  | 2023-12-16T18:00:17.803-07:00 | File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  | 2023-12-16T18:00:17.803-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 946, in group_info
  | 2023-12-16T18:00:17.804-07:00 | comp_ids, obs_group_ids = self._get_compressed_codes()
  | 2023-12-16T18:00:17.804-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 977, in _get_compressed_codes
  | 2023-12-16T18:00:17.804-07:00 | return ping.codes, np.arange(len(ping.group_index), dtype=np.intp)
  | 2023-12-16T18:00:17.804-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/grouper.py", line 621, in codes
  | 2023-12-16T18:00:17.804-07:00 | return self._codes_and_uniques[0]
  | 2023-12-16T18:00:17.804-07:00 | File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  | 2023-12-16T18:00:17.804-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/groupby/grouper.py", line 692, in _codes_and_uniques
  | 2023-12-16T18:00:17.804-07:00 | codes, uniques = algorithms.factorize( # type: ignore[assignment]
  | 2023-12-16T18:00:17.804-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/algorithms.py", line 822, in factorize
  | 2023-12-16T18:00:17.804-07:00 | codes, uniques = factorize_array(
  | 2023-12-16T18:00:17.804-07:00 | File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pandas/core/algorithms.py", line 578, in factorize_array
  | 2023-12-16T18:00:17.804-07:00 | uniques, codes = table.factorize(
  | 2023-12-16T18:00:17.804-07:00 | File "pandas/_libs/hashtable_class_helper.pxi", line 5943, in pandas._libs.hashtable.PyObjectHashTable.factorize
  | 2023-12-16T18:00:17.804-07:00 | File "pandas/_libs/hashtable_class_helper.pxi", line 5857, in pandas._libs.hashtable.PyObjectHashTable._unique
  | 2023-12-16T18:00:17.805-07:00 | TypeError: unhashable type: 'dict'
  | 2023-12-16T18:00:17.805-07:00 | The above exception was the direct cause of the following exception:
  | 2023-12-16T18:00:17.805-07:00 | Traceback (most recent call last):
  | 2023-12-16T18:00:17.805-07:00 | File "/usr/src/app/bin/build_label_model.py", line 93, in <module>
  | 2023-12-16T18:00:17.805-07:00 | eamur.update_trip_model(user_id, model_type, model_storage, min_trips)
  | 2023-12-16T18:00:17.805-07:00 | File "/usr/src/app/emission/analysis/modelling/trip_model/run_model.py", line 96, in update_trip_model
  | 2023-12-16T18:00:17.805-07:00 | raise IOError(msg) from e
  | 2023-12-16T18:00:17.805-07:00 | OSError: failure updating user label pipeline state for user 520819a4-abc4-437a-bb9f-7e2ab4b12767

@MukuFlash03
Copy link
Contributor Author

I followed the Traceback exception stack and took a look at these two files:

  1. eamur.run_model.py - uses model.fit() whose implementation is in the below file.
  2. eamur.greedy_similarity_binning.py - error occuring in _generate_predictions().

The error is - TypeError: unhashable type: 'dict' which is possibly in Line 298 (as given by trace) where unique_labels is assigned a value.

    def _generate_predictions(self):
        """
        helper function to transform binned features and labels into predictions.
        taken from [https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/build_save_model.py#L40]

        for each bin, the unique label combinations are counted. their 
        probability is estimated with label_count / total_labels.
        """
        for _, bin_record in self.bins.items():
            user_label_df = pd.DataFrame(bin_record['labels'])
            user_label_df = lp.map_labels(user_label_df).dropna()
            # compute the sum of trips in this cluster
            sum_trips = len(user_label_df)
            # compute unique label sets and their probabilities in one cluster
            # 'p' refers to probability
            group_cols = user_label_df.columns.tolist()
            unique_labels = user_label_df.groupby(group_cols).size().reset_index(name='uniqcount')

Seems the error is in the way the dataframe user_label_df is accessed as it does not support using dictionary as a key done somewhere internally in pandas library.

@shankari
Copy link
Contributor

@MukuFlash03 now that this has been deployed on production, is this happening on production systems as well? If so, this is a showstopper and we have to drop everything and fix it. If it is not happening on production, you need to investigate why this is happening.

I have shared a fairly recent staging snapshot with you; you can use it to run locally and see what is going on. In parallel, you can add additional logs to debug directly on staging.

@shankari
Copy link
Contributor

shankari commented Jan 21, 2024

@MukuFlash03 it would also be helpful, while checking "is this happening on production systems as well?" to quantify what the impact of the change was. What was the % improvement in load times? % improvement in this stage?
As you can see from the pipeline, we track the time spent in each stage in the client stats, so if you write a script to quantify this, I can run it on production and get out the numbers.

@MukuFlash03
Copy link
Contributor Author

I took a look at the Cloudwatch logs for the openpath-stage-analysis-build-trip-model log group in which the error was seen in the staging environment.
I searched for the query string "TypeError: unhashable type: 'dict'" in all log streams for the production log groups.
I found the error in the log groups of two production systems.


The other production systems log groups did not have this specific error but they did have some errors relating to insufficient or low storage space.
I came across these when querying for "OSError" instead of "TypeError" which is the current error under consideration.

@MukuFlash03
Copy link
Contributor Author

In the staging logs, I saw three distinct UUIDs that were encountering this error:
[UUID("9c084ef4-2f97-4196-bd37-950c17938ec6"), UUID("79ad933d-b6ee-4ee0-9d53-5e4f0c7e5f61"), UUID("520819a4-abc4-437a-bb9f-7e2ab4b12767")]

Of these, the first two were present as valid users in the earlier stage snapshot dataset from October that I've been testing.
So, I proceeded to debug while focussing on these two.

I found information about the model build pipeline here and launched the pipeline for each of these users:

./e-mission-py.bash bin/build_label_model.py --user_list <user ID string without "UUID">

I added some log statements for debugging and observed that in the dataframe particular column name (representing a key in our travel data) was standing out - trip_user_input.

@MukuFlash03
Copy link
Contributor Author

Following data observations were seen on running build pipeline for the 9c084ef4... user_id:

  • Total number of bins = 75

  • But, the loop in _generate_predictions() runs only for 35 bins, as it exits after failing for the 35th bin.

  • From these, 16 bins do not have trip_user_input as a column in the dataframe and this data looks correct:

       mode_confirm purpose_confirm
       0    shared_ride            home
       1    shared_ride            home
       2    shared_ride            home
       3    shared_ride            home
       4    shared_ride            home
       ..           ...             ...
       192  shared_ride            home
       193  shared_ride            home
       194  shared_ride            home
       195  shared_ride            home
       196  shared_ride            home
       
       [197 rows x 2 columns] 
    
  • The remaining 19 bins contain trip_user_input as a dataframe column.

  • Of these 19 bins, 18 bins have an Empty Dataframe, which might be a problem as perhaps data not loading correctly?

      Empty DataFrame
     Columns: [mode_confirm, purpose_confirm, trip_user_input, uniqcount]
     Index: [] 
    
  • The final 35th bin, where the pipeline fails, has the error-causing dictionary data in the dataframe:

                                           trip_user_input
     0  {'_id': 640bd0ee80ea0c11040a8def, 'user_id': 9... 
    

Getting back to the main exception:

# Occurs in this code
unique_labels = user_label_df.groupby(group_cols).size().reset_index(name='uniqcount')
# Exception
TypeError: unhashable type: 'dict'

This is why the exception was occurring as the dataframe objects and operations such as groupby() fail when an unhashable object like a dictionary or list is present.

@shankari
Copy link
Contributor

The final 35th bin, where the pipeline fails, has the error-causing dictionary data in the dataframe:

This is because the user is providing survey inputs and not labels (MULTILABEL). This is expected on staging, and one of the production environments (washingtoncommons). But there should be users on staging that are not using surveys, and there should not be a second production environment (IIRC) that uses surveys. What was the second production environment that was seeing this error?

@MukuFlash03
Copy link
Contributor Author

The final 35th bin, where the pipeline fails, has the error-causing dictionary data in the dataframe:

This is because the user is providing survey inputs and not labels (MULTILABEL). This is expected on staging, and one of the production environments (washingtoncommons). But there should be users on staging that are not using surveys, and there should not be a second production environment (IIRC) that uses surveys. What was the second production environment that was seeing this error?

Oh alright, I see.
Yes, there is another production environment - nrel-commute.
washingtoncommons has this error logged quite recently from Jan '24.
While nrel-commute has this since Aug '23.

@shankari
Copy link
Contributor

shankari commented Jan 25, 2024

Ah I think that one of the OpenPATH devs has been using nrel-commute for testing. Can you check the related UUID and post it internally to confirm?

If so, we understand the situation, we just have to handle it properly so that if we have a few survey responses, it won't break the rest of the model building. This is not critical right now because our inputs are typically either multilabel or survey, but our use cases expand, we may want to support mixed inputs.

I think that the fix to ignore dict entries should be fairly trivial, so we should do it now, since it will also ensure that any other errors on staging are not masked by this issue.

@MukuFlash03
Copy link
Contributor Author

Debugging process after this involved confirming whether the production environment under concern was one of ours. It was indeed confirmed as ours once we had fetched the opcode (email / token) found in the Cloudwatch webserver logs.

The opcode is the email / token generated which I found by first searching through all the webserver logs for the specific UUID. This was taking a lot of time to search but finally found some logs after which had these lines:

Using the skip method to verify id token ...
retUUID = <UUID>

This gave me a hint to first search for retUUID = <UUID under question> so as to see when the opcode / token was being generated. Then I filtered the AWS logs for specific times and finally found the opcode for this UUID.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Jan 26, 2024

Code fixes for this issue are in this PR.
The high level plan, as suggested by Shankari, was to ignore the dictionary entries occurring in the user_label_df which was breaking the df.groupby() function.

After quite some trial and testing with the code, was able to do this in a possibly efficient manner. Some approaches I tried and then scrapped them were:

  1. Manually iterating through a specific row in the dataframe using a for loop.
    Cons:
  • This could potentially be really slow as the apply() methods are much more optimized to perform operations on dataframes in Python. Official documentation here says so too.
  1. Using a boolean value to check if a dictionary is present in the data frame either row wise or collectively in the dataframe (logical OR operation implemented using any()).
    Cons:
  • The code was getting too complex with a 2-3 nested if-else blocks and would have add to move the existing code logic into the if blocks.
  • Also, in this logic, I was skipping the entire bin, even if 1 row in a dataframe had a dictionary, since I thought that the groupby() is applied on the entire dataframe, it's better to skip the entire dataframe.
  • But, then I thought, there's a chance that there exist both dictionary and non-dictionary elements in the dataframe, in which case skipping the entire bin would be incorrect.
  1. Filtering out dictionary rows while preserving non-dictionary ones in the dataframe.
  • Hence, I finalized on the logic to do this which doesn't skip the entire bin but replaces the dataframe with a filtered version after removing the error-causing unhashable type dictionary elements from the dataframe by checking the data type using isinstance().
  • This was much cleaner code as a simple one-liner fix and was now able to complete the model building process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants