Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👔 Switch the survey assist to use the random forest model #972

Open
shankari opened this issue Sep 7, 2023 · 25 comments
Open

👔 Switch the survey assist to use the random forest model #972

shankari opened this issue Sep 7, 2023 · 25 comments

Comments

@shankari
Copy link
Contributor

shankari commented Sep 7, 2023

The survey assist code currently uses GreedySimilarityBinning on production.
As we can see from the survey assist paper by Hannah Lu (https://www.nrel.gov/docs/fy23osti/84502.pdf), random forest models that include distance and duration perform better than pure spatial models. So let's switch to them, at least for label-based studies.

High level changes:

  • create a new algorithm for the random forest, including new extractor, new model and new save/load functionality
  • if the trip survey in the dynamic config is MULTILABEL, use the random forest model
    • if there is a mode of interest, predict the replaced mode, otherwise ignore
  • if the trip survey is not MULTILABEL, ignore

Next set of changes:

  • if the trip survey is not MULTILABEL, use the spatial algorithm

In both cases, you may need to change the format of the inferred_labels. it is currently an array of label tuples with different probabilities,

  1. random forest will give us one result by default. We may want to use predict_proba to keep the structure consistent, or change the structure, which will also require UI changes
  2. the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels. Again, we will need to think about what the inferred_labels datastructure would look like in this case and potentially change the phone code to match.
@shankari
Copy link
Contributor Author

shankari commented Sep 8, 2023

@humbleOldSage can you please tackle this one?

@humbleOldSage
Copy link

To understand the flow of build, which we'll have to change, I began with searching mention of GreedySimilarityBinning in e-mission-server repository :

$ grep -rl GreedySimilarityBinning e-mission-server/ | grep -v __pycache__ | grep -v .git
e-mission-server//emission/analysis/modelling/trip_model/model_type.py
e-mission-server//emission/analysis/modelling/trip_model/greedy_similarity_binning.py
e-mission-server//emission/tests/modellingTests/TestBackwardsCompat.py
e-mission-server//emission/tests/modellingTests/TestRunGreedyModel.py
e-mission-server//emission/tests/modellingTests/TestGreedySimilarityBinning.py

the tests files are not relevant to us at this point. the greedy_similarity_binning.py is where the greedy model is defined, but no way to build model there. We are left with just model_type.py , which has a build function, which is what we were looking for . This is referenced by update_trip_model in run_model.py.

@humbleOldSage
Copy link

humbleOldSage commented Sep 21, 2023

update_trip_model is referenced bybuild_label_model.py in main.

$ grep -rl update_trip_model | grep -v __pycache__
./e-mission-server/emission/analysis/modelling/trip_model/run_model.py
./e-mission-server/emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
./e-mission-server/emission/tests/modellingTests/TestRunGreedyModel.py
./e-mission-server/bin/build_label_model.py

and so this will be the starting point of all the changes once we have implemented the RF model.

Additionally, we'll have to change configs in trip_model.conf.json.sample

@humbleOldSage
Copy link

For the RF model, there's an implementation being used in TRB_label_assist's models.py file for analysis, however it might lack functionalities. Let me try to build parallels between the ways GreedySimilarityBinning is being used in prod and in analysis. This will help us extend RF model to prod in the same way that GreedySimilarityBinning was extended.

@shankari
Copy link
Contributor Author

Just to clarify, we are not going to change the current model directly.
Instead, we will create a new "random forest algorithm" class, similar to GreedySimilarityBinning

Additionally, we'll have to change configs in trip_model.conf.json.sample

We will define a config for random forest (maybe with the hyperparameters or ....)
And then we will say that the "selected" model is random forest.

So if we ever wanted to switch back to greedysimilarity, it would be as simple as changing the configuration

@shankari
Copy link
Contributor Author

@rahulkulhalli identified a couple of issues with the current TRB label assist code

  1. From Investigating the high variance counts for certain users and modes in label assist #951 (comment)

The bootstrap parameter was set to False. Was this intentional? If random forests are allowed to bootstrap, they create trees on different sub-samples of data. This allows the model to generalize much better as compared to fitting a tree on the entire dataset. By disabling the bootstrap feature, we're basically creating n_estimator trees, and each tree is fit on the entire dataset.

I don't think it is going to make a huge difference, but for the paper, we might as well change it and see if it improves things. First I should merge your changes without any parameter changes, but then we can actually change the model to try and get better results.

@shankari
Copy link
Contributor Author

@humbleOldSage I found the other note/cleanup from @rahulkulhalli. It was in Teams chat and not in an issue, which is why I couldn't find it

I also wanted to report something I came across while reading the TRB code during the weekend - the ForestEstimator model creates three model instances using the same model hyper-parameters. We're not tuning each model separately
We should ideally be using different hyper-parameters for each model. Here, we pass the same configuration to the three models.
https://github.com/e-mission/e-mission-eval-private-data/blob/master/TRB_label_assist/models.py#L1515

@humbleOldSage
Copy link

yeah. thats the plan

@humbleOldSage
Copy link

While including Forest Classifier, I realized that Random forest as implemented in sklearn ( and used during analysis) uses dataFrame type data. However, Greedy similarity binning in e-mission-server is working on ConfirmedTrip type data . A way to make both of these work and keep the code as generic as possible is to pass both List of ConfirmedTrip and dataframe type data via the fit function as below :

model.fit(trips,trips_df)

and then each model can use type of data it wants.

@humbleOldSage
Copy link

About model storage :

Till now the model storage implementation that is present is just for greedySimilaritybinning . Lets check to figure how generic it is.

@humbleOldSage
Copy link

So the storage for greedySimilarityBinning is done by storing just the bins generated after clustering in the form of 1 dictionary .The flow is as below :

update_trip_model in run_model build the functions, then loads data and finally fits the model, after fitting, it calls for to_dict as :

model_data_next = model.to_dict()

which is basically a call to get back bins in case of greedySimilarity in greedy_similarity_binning.py, as below :

def to_dict(self) -> Dict:
    return self.bins

which is then saved as :

eamums.save_model(user_id, model_type, model_data_next, last_done_ts, model_storage)
Now, In case of Random forest, I implemented this to_dict function in forest_classifier.py file as below :

def to_dict(self) -> Dict:
    return joblib.dump(self,compress=3)

in the last commit .

However, even though this might work, the problem is it dumps the entire forestClassifier model for saving, which also includes
the data in the form of self.Xy_train which is not desirable.

The other way is to just save the three predictors ( self.pupose_predictor,self.mode_predictor and self.replaced_predictor) and the encoder ( in case when we set RF to use cluster).

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

@shankari what would you suggest I do ?

@humbleOldSage
Copy link

In case of testing,

for consistency testing, this is what I had in mind :

From your training set, take duplicate 10% of your data and keep it to the side. This will become your mock data.
Train, validate, and test your model as-is. Now, using the trained model, run an inference pass through the 10% of data and record the results/predictions. This is your ground truth.
Next time onwards, when the model is to be tested, run the same 10% of data through the model and compare the model's outputs with the ones you've stored.

Since this requires the predict part as well, I will have to implement this at a later stage. For now , since I have only written fit function, the only test other than build test that I can write would be to check for Null and run time exceptions.

@shankari
Copy link
Contributor Author

shankari commented Oct 2, 2023

model.fit(trips,trips_df)

This is bad design. You should not pass in duplicate data to the function - what happens if the code that calls it uses different datasets for the two. Yes, we can check for them being the same, but it is better to tailor the interface to not be wrong in the first place. You should accept one of them (either trip list or trips_df) and convert to the other internally. I would suggest accepting trips and converting to trips_df using to_data_df from the timeseries.

@humbleOldSage
Copy link

humbleOldSage commented Oct 2, 2023

You should accept one of them (either trip list or trips_df) and convert to the other internally. I would suggest accepting trips and converting to trips_df using to_data_df from the timeseries.

Will do. I figured this would require passing at least the object (ts) of timeseries type :

ts = esta.TimeSeries.get_time_series(u)

(where u is the user id)
to fit function since to_data_df is part of Timeseries class. so fit would be something like :

model.fit(trip,ts).

@shankari
Copy link
Contributor Author

shankari commented Oct 2, 2023

for consistency testing, this is what I had in mind :

Where is this quote "From your training set, take duplicate 10% of your data and keep it to the side." from?
Please sure to cite all your sources - it is required from an ethical perspective, and also lets me judge the reliability of the source.

As for the idea itself, it is interesting, but it needs some more clarity. What is "your dataset" in this case?

Will do. I figured this would require passing at least the object (ts) of timeseries type :

This is wrong. Please look at prior conversion examples in the codebase.

@humbleOldSage
Copy link

humbleOldSage commented Oct 2, 2023

Where is this quote "From your training set, take duplicate 10% of your data and keep it to the side." from?

This is directly from a conversation me and Rahul regarding tests this morning. Just our thoughts.

Since we are testing, "Your data" here would be randomly generated data for testing purposes.

@shankari
Copy link
Contributor Author

shankari commented Oct 2, 2023

Then it is good to credit @rahulkulhalli as well!

@shankari
Copy link
Contributor Author

shankari commented Oct 3, 2023

However, even though this might work, the problem is it dumps the entire forestClassifier model for saving, which also includes the data in the form of self.Xy_train which is not desirable.

Dumping Xy_train is sub-optimal but is not really wrong - we currently do dump all the start/end points of our training set because to predict, we have to calculate the distances between the new trip and all the existing clusters. So this won't be substantially different.

The other way is to just save the three predictors ( self.pupose_predictor,self.mode_predictor and self.replaced_predictor) and the encoder ( in case when we set RF to use cluster).

This does seem better.

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

I don't understand this. Why would this not be generic? It seems like the "model" can be any dictionary that we want.

@humbleOldSage
Copy link

humbleOldSage commented Oct 3, 2023

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

I don't understand this. Why would this not be generic? It seems like the "model" can be any dictionary that we want.

GreedySimilarityBinning is saving dictionary of list ,where keys are bin names/numbers and value of each key is list of trips that fall in the bins.
For RF model, we'll be storing the list of 3 predictors and 1 encoders in dictionary.

And in this way the data stored for each model is different.

@shankari
Copy link
Contributor Author

We discussed this in a meeting - the idea is that we can write a custom loader and saver for each model. It is fully expected that the data stored for each model is different.

@humbleOldSage
Copy link

humbleOldSage commented Nov 26, 2023

@shankari After FirstPoint#35,, as discussed , I'll now move all the models file from e-mission-eval-private to e-mission-server ( along with their git history). The analysis code stays in e-mission-eval-private-data.

@humbleOldSage
Copy link

humbleOldSage commented Dec 15, 2023

One of the tests failing currently in PR938 includes the test in TestPipelineRealData.py. On running this first time, the file threw circular import error in like 2 mins. I improved on that and then ran the tests again. However, this time the tests in this file has been running for upward of 2 hours and still running ..

Image

sort of stuck at this point. There's activity in docker container so Its not stuck that's for sure. I ran the below file to reset the pipeline if its in an invalid state

`e-mission-server/bin/monitor/reset_invalid_pipeline_states.py'

Looks like its not.
.

@humbleOldSage
Copy link

humbleOldSage commented Dec 15, 2023

@shankari Is there any pipeline resetting between consecutive runs that I might be missing here??

@humbleOldSage
Copy link

It is in fact moving. I believe this might be due to the size of database.

@humbleOldSage
Copy link

This took 2 days to complete but eventually the tests were OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Issues being worked on
Development

No branches or pull requests

2 participants