-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👔 Switch the survey assist to use the random forest model #972
Comments
@humbleOldSage can you please tackle this one? |
To understand the flow of build, which we'll have to change, I began with searching mention of
the tests files are not relevant to us at this point. the |
and so this will be the starting point of all the changes once we have implemented the RF model. Additionally, we'll have to change configs in |
For the RF model, there's an implementation being used in |
Just to clarify, we are not going to change the current model directly.
We will define a config for random forest (maybe with the hyperparameters or ....) So if we ever wanted to switch back to greedysimilarity, it would be as simple as changing the configuration |
@rahulkulhalli identified a couple of issues with the current TRB label assist code
I don't think it is going to make a huge difference, but for the paper, we might as well change it and see if it improves things. First I should merge your changes without any parameter changes, but then we can actually change the model to try and get better results. |
@humbleOldSage I found the other note/cleanup from @rahulkulhalli. It was in Teams chat and not in an issue, which is why I couldn't find it
|
yeah. thats the plan |
While including Forest Classifier, I realized that Random forest as implemented in sklearn ( and used during analysis) uses dataFrame type data. However, Greedy similarity binning in
and then each model can use type of data it wants. |
About model storage : Till now the model storage implementation that is present is just for |
So the storage for
which is basically a call to get back bins in case of
which is then saved as :
in the last commit . However, even though this might work, the problem is it dumps the entire forestClassifier model for saving, which also includes The other way is to just save the three predictors ( But then loading the model ( using @shankari what would you suggest I do ? |
In case of testing, for consistency testing, this is what I had in mind :
Since this requires the predict part as well, I will have to implement this at a later stage. For now , since I have only written fit function, the only test other than build test that I can write would be to check for Null and run time exceptions. |
This is bad design. You should not pass in duplicate data to the function - what happens if the code that calls it uses different datasets for the two. Yes, we can check for them being the same, but it is better to tailor the interface to not be wrong in the first place. You should accept one of them (either trip list or trips_df) and convert to the other internally. I would suggest accepting trips and converting to trips_df using |
Will do. I figured this would require passing at least the object (ts) of timeseries type :
(where u is the user id)
|
Where is this quote "From your training set, take duplicate 10% of your data and keep it to the side." from? As for the idea itself, it is interesting, but it needs some more clarity. What is "your dataset" in this case?
This is wrong. Please look at prior conversion examples in the codebase. |
This is directly from a conversation me and Rahul regarding tests this morning. Just our thoughts. Since we are testing, "Your data" here would be randomly generated data for testing purposes. |
Then it is good to credit @rahulkulhalli as well! |
Dumping
This does seem better.
I don't understand this. Why would this not be generic? It seems like the "model" can be any dictionary that we want. |
GreedySimilarityBinning is saving dictionary of list ,where keys are bin names/numbers and value of each key is list of trips that fall in the bins. And in this way the data stored for each model is different. |
We discussed this in a meeting - the idea is that we can write a custom loader and saver for each model. It is fully expected that the data stored for each model is different. |
@shankari After FirstPoint#35,, as discussed , I'll now move all the models file from e-mission-eval-private to e-mission-server ( along with their git history). The analysis code stays in e-mission-eval-private-data. |
One of the tests failing currently in PR938 includes the test in sort of stuck at this point. There's activity in docker container so Its not stuck that's for sure. I ran the below file to reset the pipeline if its in an invalid state `e-mission-server/bin/monitor/reset_invalid_pipeline_states.py' Looks like its not. |
@shankari Is there any pipeline resetting between consecutive runs that I might be missing here?? |
It is in fact moving. I believe this might be due to the size of database. |
This took 2 days to complete but eventually the tests were OK. |
The survey assist code currently uses
GreedySimilarityBinning
on production.As we can see from the survey assist paper by Hannah Lu (https://www.nrel.gov/docs/fy23osti/84502.pdf), random forest models that include distance and duration perform better than pure spatial models. So let's switch to them, at least for label-based studies.
High level changes:
Next set of changes:
In both cases, you may need to change the format of the inferred_labels. it is currently an array of label tuples with different probabilities,
predict_proba
to keep the structure consistent, or change the structure, which will also require UI changesinferred_labels
datastructure would look like in this case and potentially change the phone code to match.The text was updated successfully, but these errors were encountered: