-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing on staging for loading models and improving scalability #1039
Comments
I did some more testing by re-running tests locally and running model load mechanism for debugging why statements are not being printed. I had the following observations:
There is no if-else block, or conditional block that these print statements are a part of, hence I am unclear why I cannot observe them in the logs. In the next observation, I tried replacing print() with logging.debug() which does not print anything at all specific to the statements I added.
This is strange since in the same file:
|
B. Exception while building / loading model Found some exceptions when checking the
|
I followed the Traceback exception stack and took a look at these two files:
The error is - TypeError: unhashable type: 'dict' which is possibly in Line 298 (as given by trace) where unique_labels is assigned a value.
Seems the error is in the way the dataframe user_label_df is accessed as it does not support using dictionary as a key done somewhere internally in pandas library. |
@MukuFlash03 now that this has been deployed on production, is this happening on production systems as well? If so, this is a showstopper and we have to drop everything and fix it. If it is not happening on production, you need to investigate why this is happening. I have shared a fairly recent staging snapshot with you; you can use it to run locally and see what is going on. In parallel, you can add additional logs to debug directly on staging. |
@MukuFlash03 it would also be helpful, while checking "is this happening on production systems as well?" to quantify what the impact of the change was. What was the % improvement in load times? % improvement in this stage? |
I took a look at the Cloudwatch logs for the The other production systems log groups did not have this specific error but they did have some errors relating to insufficient or low storage space. |
In the staging logs, I saw three distinct UUIDs that were encountering this error: Of these, the first two were present as valid users in the earlier stage snapshot dataset from October that I've been testing. I found information about the model build pipeline here and launched the pipeline for each of these users:
I added some log statements for debugging and observed that in the dataframe particular column name (representing a key in our travel data) was standing out - |
Following data observations were seen on running build pipeline for the
Getting back to the main exception:
This is why the exception was occurring as the dataframe objects and operations such as groupby() fail when an unhashable object like a dictionary or list is present. |
This is because the user is providing survey inputs and not labels (MULTILABEL). This is expected on staging, and one of the production environments (washingtoncommons). But there should be users on staging that are not using surveys, and there should not be a second production environment (IIRC) that uses surveys. What was the second production environment that was seeing this error? |
Oh alright, I see. |
Ah I think that one of the OpenPATH devs has been using If so, we understand the situation, we just have to handle it properly so that if we have a few survey responses, it won't break the rest of the model building. This is not critical right now because our inputs are typically either multilabel or survey, but our use cases expand, we may want to support mixed inputs. I think that the fix to ignore |
Debugging process after this involved confirming whether the production environment under concern was one of ours. It was indeed confirmed as ours once we had fetched the opcode (email / token) found in the Cloudwatch webserver logs. The opcode is the email / token generated which I found by first searching through all the webserver logs for the specific UUID. This was taking a lot of time to search but finally found some logs after which had these lines:
This gave me a hint to first search for |
Code fixes for this issue are in this PR. After quite some trial and testing with the code, was able to do this in a possibly efficient manner. Some approaches I tried and then scrapped them were:
|
Notes on Testing on Staging
*** Moving stuff over from this issue + added more findings ***
I got to know that staging environment that we have set up internally isn't something you can just execute and test.
It's already been executed and we can observe the logs directly to see if things are working properly.
I set the time frame for searching from: 12/03 to 12/16
Since, this implementation included the model loading for the Label inference pipeline stage, I figured the correct cloudwatch logs to observe would be: openpath-stage-analysis-intake
It looks like model isn't loaded for the two users present in the time range checked from 12/03 to 12/16.
Additionally, some of my print statements are not being logged. I am not sure whether that's because I used print() instead of logging.debug() since the pipeline stage messages with (***) are also logged using print().
The text was updated successfully, but these errors were encountered: