👔 Switch the survey assist to use the random forest model #972

shankari · 2023-09-07T23:45:51Z

The survey assist code currently uses GreedySimilarityBinning on production.
As we can see from the survey assist paper by Hannah Lu (https://www.nrel.gov/docs/fy23osti/84502.pdf), random forest models that include distance and duration perform better than pure spatial models. So let's switch to them, at least for label-based studies.

High level changes:

create a new algorithm for the random forest, including new extractor, new model and new save/load functionality
if the trip survey in the dynamic config is MULTILABEL, use the random forest model
- if there is a mode of interest, predict the replaced mode, otherwise ignore
if the trip survey is not MULTILABEL, ignore

Next set of changes:

if the trip survey is not MULTILABEL, use the spatial algorithm

In both cases, you may need to change the format of the inferred_labels. it is currently an array of label tuples with different probabilities,

random forest will give us one result by default. We may want to use predict_proba to keep the structure consistent, or change the structure, which will also require UI changes
the match for non MULTILABEL programs will be survey responses, in XML and JSON, not a tuple of labels. Again, we will need to think about what the inferred_labels datastructure would look like in this case and potentially change the phone code to match.

The text was updated successfully, but these errors were encountered:

shankari · 2023-09-08T00:28:34Z

@humbleOldSage can you please tackle this one?

humbleOldSage · 2023-09-20T21:34:52Z

To understand the flow of build, which we'll have to change, I began with searching mention of GreedySimilarityBinning in e-mission-server repository :

$ grep -rl GreedySimilarityBinning e-mission-server/ | grep -v __pycache__ | grep -v .git
e-mission-server//emission/analysis/modelling/trip_model/model_type.py
e-mission-server//emission/analysis/modelling/trip_model/greedy_similarity_binning.py
e-mission-server//emission/tests/modellingTests/TestBackwardsCompat.py
e-mission-server//emission/tests/modellingTests/TestRunGreedyModel.py
e-mission-server//emission/tests/modellingTests/TestGreedySimilarityBinning.py

the tests files are not relevant to us at this point. the greedy_similarity_binning.py is where the greedy model is defined, but no way to build model there. We are left with just model_type.py , which has a build function, which is what we were looking for . This is referenced by update_trip_model in run_model.py.

humbleOldSage · 2023-09-21T15:00:50Z

update_trip_model is referenced bybuild_label_model.py in main.

$ grep -rl update_trip_model | grep -v __pycache__
./e-mission-server/emission/analysis/modelling/trip_model/run_model.py
./e-mission-server/emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
./e-mission-server/emission/tests/modellingTests/TestRunGreedyModel.py
./e-mission-server/bin/build_label_model.py

and so this will be the starting point of all the changes once we have implemented the RF model.

Additionally, we'll have to change configs in trip_model.conf.json.sample

humbleOldSage · 2023-09-21T15:41:31Z

For the RF model, there's an implementation being used in TRB_label_assist's models.py file for analysis, however it might lack functionalities. Let me try to build parallels between the ways GreedySimilarityBinning is being used in prod and in analysis. This will help us extend RF model to prod in the same way that GreedySimilarityBinning was extended.

shankari · 2023-09-21T22:44:09Z

Just to clarify, we are not going to change the current model directly.
Instead, we will create a new "random forest algorithm" class, similar to GreedySimilarityBinning

Additionally, we'll have to change configs in trip_model.conf.json.sample

We will define a config for random forest (maybe with the hyperparameters or ....)
And then we will say that the "selected" model is random forest.

So if we ever wanted to switch back to greedysimilarity, it would be as simple as changing the configuration

shankari · 2023-09-21T22:57:30Z

@rahulkulhalli identified a couple of issues with the current TRB label assist code

From Investigating the high variance counts for certain users and modes in label assist #951 (comment)

The bootstrap parameter was set to False. Was this intentional? If random forests are allowed to bootstrap, they create trees on different sub-samples of data. This allows the model to generalize much better as compared to fitting a tree on the entire dataset. By disabling the bootstrap feature, we're basically creating n_estimator trees, and each tree is fit on the entire dataset.

I don't think it is going to make a huge difference, but for the paper, we might as well change it and see if it improves things. First I should merge your changes without any parameter changes, but then we can actually change the model to try and get better results.

shankari · 2023-09-22T02:51:31Z

@humbleOldSage I found the other note/cleanup from @rahulkulhalli. It was in Teams chat and not in an issue, which is why I couldn't find it

I also wanted to report something I came across while reading the TRB code during the weekend - the ForestEstimator model creates three model instances using the same model hyper-parameters. We're not tuning each model separately
We should ideally be using different hyper-parameters for each model. Here, we pass the same configuration to the three models.
https://github.com/e-mission/e-mission-eval-private-data/blob/master/TRB_label_assist/models.py#L1515

humbleOldSage · 2023-09-25T10:36:52Z

yeah. thats the plan

humbleOldSage · 2023-10-01T22:07:21Z

While including Forest Classifier, I realized that Random forest as implemented in sklearn ( and used during analysis) uses dataFrame type data. However, Greedy similarity binning in e-mission-server is working on ConfirmedTrip type data . A way to make both of these work and keep the code as generic as possible is to pass both List of ConfirmedTrip and dataframe type data via the fit function as below :

model.fit(trips,trips_df)

and then each model can use type of data it wants.

humbleOldSage · 2023-10-01T22:08:56Z

About model storage :

Till now the model storage implementation that is present is just for greedySimilaritybinning . Lets check to figure how generic it is.

humbleOldSage · 2023-10-02T16:20:14Z

So the storage for greedySimilarityBinning is done by storing just the bins generated after clustering in the form of 1 dictionary .The flow is as below :

update_trip_model in run_model build the functions, then loads data and finally fits the model, after fitting, it calls for to_dict as :

model_data_next = model.to_dict()

which is basically a call to get back bins in case of greedySimilarity in greedy_similarity_binning.py, as below :

def to_dict(self) -> Dict:
    return self.bins

which is then saved as :

eamums.save_model(user_id, model_type, model_data_next, last_done_ts, model_storage)
Now, In case of Random forest, I implemented this to_dict function in forest_classifier.py file as below :

def to_dict(self) -> Dict:
    return joblib.dump(self,compress=3)

in the last commit .

However, even though this might work, the problem is it dumps the entire forestClassifier model for saving, which also includes
the data in the form of self.Xy_train which is not desirable.

The other way is to just save the three predictors ( self.pupose_predictor,self.mode_predictor and self.replaced_predictor) and the encoder ( in case when we set RF to use cluster).

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

@shankari what would you suggest I do ?

humbleOldSage · 2023-10-02T16:23:24Z

In case of testing,

for consistency testing, this is what I had in mind :

From your training set, take duplicate 10% of your data and keep it to the side. This will become your mock data.
Train, validate, and test your model as-is. Now, using the trained model, run an inference pass through the 10% of data and record the results/predictions. This is your ground truth.
Next time onwards, when the model is to be tested, run the same 10% of data through the model and compare the model's outputs with the ones you've stored.

Since this requires the predict part as well, I will have to implement this at a later stage. For now , since I have only written fit function, the only test other than build test that I can write would be to check for Null and run time exceptions.

shankari · 2023-10-02T17:31:48Z

model.fit(trips,trips_df)

This is bad design. You should not pass in duplicate data to the function - what happens if the code that calls it uses different datasets for the two. Yes, we can check for them being the same, but it is better to tailor the interface to not be wrong in the first place. You should accept one of them (either trip list or trips_df) and convert to the other internally. I would suggest accepting trips and converting to trips_df using to_data_df from the timeseries.

humbleOldSage · 2023-10-02T17:45:58Z

You should accept one of them (either trip list or trips_df) and convert to the other internally. I would suggest accepting trips and converting to trips_df using to_data_df from the timeseries.

Will do. I figured this would require passing at least the object (ts) of timeseries type :

ts = esta.TimeSeries.get_time_series(u)

(where u is the user id)
to fit function since to_data_df is part of Timeseries class. so fit would be something like :

model.fit(trip,ts).

shankari · 2023-10-02T17:49:09Z

for consistency testing, this is what I had in mind :

Where is this quote "From your training set, take duplicate 10% of your data and keep it to the side." from?
Please sure to cite all your sources - it is required from an ethical perspective, and also lets me judge the reliability of the source.

As for the idea itself, it is interesting, but it needs some more clarity. What is "your dataset" in this case?

Will do. I figured this would require passing at least the object (ts) of timeseries type :

This is wrong. Please look at prior conversion examples in the codebase.

humbleOldSage · 2023-10-02T18:05:06Z

Where is this quote "From your training set, take duplicate 10% of your data and keep it to the side." from?

This is directly from a conversation me and Rahul regarding tests this morning. Just our thoughts.

Since we are testing, "Your data" here would be randomly generated data for testing purposes.

shankari · 2023-10-02T18:06:17Z

Then it is good to credit @rahulkulhalli as well!

shankari · 2023-10-03T21:42:37Z

However, even though this might work, the problem is it dumps the entire forestClassifier model for saving, which also includes the data in the form of self.Xy_train which is not desirable.

Dumping Xy_train is sub-optimal but is not really wrong - we currently do dump all the start/end points of our training set because to predict, we have to calculate the distances between the new trip and all the existing clusters. So this won't be substantially different.

The other way is to just save the three predictors ( self.pupose_predictor,self.mode_predictor and self.replaced_predictor) and the encoder ( in case when we set RF to use cluster).

This does seem better.

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

I don't understand this. Why would this not be generic? It seems like the "model" can be any dictionary that we want.

humbleOldSage · 2023-10-03T21:51:14Z

But then loading the model ( using model_storage.py) will not remain generic for greedySimilarity and RF model.

I don't understand this. Why would this not be generic? It seems like the "model" can be any dictionary that we want.

GreedySimilarityBinning is saving dictionary of list ,where keys are bin names/numbers and value of each key is list of trips that fall in the bins.
For RF model, we'll be storing the list of 3 predictors and 1 encoders in dictionary.

And in this way the data stored for each model is different.

shankari · 2023-10-21T19:09:46Z

We discussed this in a meeting - the idea is that we can write a custom loader and saver for each model. It is fully expected that the data stored for each model is different.

humbleOldSage · 2023-11-26T23:16:23Z

@shankari After FirstPoint#35,, as discussed , I'll now move all the models file from e-mission-eval-private to e-mission-server ( along with their git history). The analysis code stays in e-mission-eval-private-data.

humbleOldSage · 2023-12-15T03:17:56Z

One of the tests failing currently in PR938 includes the test in TestPipelineRealData.py. On running this first time, the file threw circular import error in like 2 mins. I improved on that and then ran the tests again. However, this time the tests in this file has been running for upward of 2 hours and still running ..

sort of stuck at this point. There's activity in docker container so Its not stuck that's for sure. I ran the below file to reset the pipeline if its in an invalid state

`e-mission-server/bin/monitor/reset_invalid_pipeline_states.py'

Looks like its not.
.

humbleOldSage · 2023-12-15T03:41:30Z

@shankari Is there any pipeline resetting between consecutive runs that I might be missing here??

humbleOldSage · 2023-12-15T05:14:20Z

It is in fact moving. I believe this might be due to the size of database.

humbleOldSage · 2023-12-17T13:59:28Z

This took 2 days to complete but eventually the tests were OK.

humbleOldSage added this to OpenPATH Tasks Overview Sep 20, 2023

humbleOldSage moved this to Issues being worked on in OpenPATH Tasks Overview Sep 20, 2023

MukuFlash03 mentioned this issue Feb 2, 2024

Fix for model-build failure due to presence of survey inputs as a dictionary e-mission/e-mission-server#954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👔 Switch the survey assist to use the random forest model #972

👔 Switch the survey assist to use the random forest model #972

shankari commented Sep 7, 2023

shankari commented Sep 8, 2023

humbleOldSage commented Sep 20, 2023

humbleOldSage commented Sep 21, 2023 •

edited

Loading

humbleOldSage commented Sep 21, 2023

shankari commented Sep 21, 2023

shankari commented Sep 21, 2023

shankari commented Sep 22, 2023

humbleOldSage commented Sep 25, 2023

humbleOldSage commented Oct 1, 2023

humbleOldSage commented Oct 1, 2023

humbleOldSage commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023

shankari commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023 •

edited

Loading

shankari commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023 •

edited

Loading

shankari commented Oct 2, 2023

shankari commented Oct 3, 2023

humbleOldSage commented Oct 3, 2023 •

edited

Loading

shankari commented Oct 21, 2023

humbleOldSage commented Nov 26, 2023 •

edited

Loading

humbleOldSage commented Dec 15, 2023 •

edited

Loading

humbleOldSage commented Dec 15, 2023 •

edited

Loading

humbleOldSage commented Dec 15, 2023

humbleOldSage commented Dec 17, 2023

👔 Switch the survey assist to use the random forest model #972

👔 Switch the survey assist to use the random forest model #972

Comments

shankari commented Sep 7, 2023

shankari commented Sep 8, 2023

humbleOldSage commented Sep 20, 2023

humbleOldSage commented Sep 21, 2023 • edited Loading

humbleOldSage commented Sep 21, 2023

shankari commented Sep 21, 2023

shankari commented Sep 21, 2023

shankari commented Sep 22, 2023

humbleOldSage commented Sep 25, 2023

humbleOldSage commented Oct 1, 2023

humbleOldSage commented Oct 1, 2023

humbleOldSage commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023

shankari commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023 • edited Loading

shankari commented Oct 2, 2023

humbleOldSage commented Oct 2, 2023 • edited Loading

shankari commented Oct 2, 2023

shankari commented Oct 3, 2023

humbleOldSage commented Oct 3, 2023 • edited Loading

shankari commented Oct 21, 2023

humbleOldSage commented Nov 26, 2023 • edited Loading

humbleOldSage commented Dec 15, 2023 • edited Loading

humbleOldSage commented Dec 15, 2023 • edited Loading

humbleOldSage commented Dec 15, 2023

humbleOldSage commented Dec 17, 2023

humbleOldSage commented Sep 21, 2023 •

edited

Loading

humbleOldSage commented Oct 2, 2023 •

edited

Loading

humbleOldSage commented Oct 2, 2023 •

edited

Loading

humbleOldSage commented Oct 3, 2023 •

edited

Loading

humbleOldSage commented Nov 26, 2023 •

edited

Loading

humbleOldSage commented Dec 15, 2023 •

edited

Loading

humbleOldSage commented Dec 15, 2023 •

edited

Loading