-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Survey Assist Using RF #938
base: master
Are you sure you want to change the base?
Survey Assist Using RF #938
Conversation
…result generation more carefully
…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` now have no dependence on the custom branch. Results of plots are attached to show no difference in theie previous and current outputs.
Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions. Default set to 'none'.
…results.py` Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual. Other minor fixes to support previous changes.
1. removed mentions of `tour_model` or `tour_model_first_only` . 2. removed two reads from database. 3. Removed notebook outputs ( this could be the reason a few diffs are too big to view)
RF initialisation and fit function. Build test written and tested. fit of Random forest uses df .
Predict is now included. Just need to figure out model storage and model testing.
Model loading and storing is now improved since it just stores the required predictors and encoders. Regression test and null value test included in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to include DBSCAN and clustering, but it should be in a different module instead of having a large bolus of code that is copy-pasted from one repo to another. DRY!
@@ -57,7 +57,6 @@ def update_trip_model( | |||
logging.debug(f'time query for training data collection: {time_query}') | |||
|
|||
trips = _get_training_data(user_id, time_query) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see this extraneous whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
from sklearn.preprocessing import OneHotEncoder | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.impute import SimpleImputer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we generally don't use from X import Y
in our python code.
Please change to
from sklearn.preprocessing import OneHotEncoder | |
from sklearn.pipeline import make_pipeline | |
from sklearn.impute import SimpleImputer | |
import sklearn as skl |
or
from sklearn.preprocessing import OneHotEncoder | |
from sklearn.pipeline import make_pipeline | |
from sklearn.impute import SimpleImputer | |
import sklearn.preprocessing as sklp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now moved to models.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this is essentially copied over from
https://github.com/e-mission/e-mission-eval-private-data/blob/master/TRB_label_assist/models.py
At the very least, you should cite it.
Even better would be if you could import it directly with full history and commit it here directly so I could know that the code is the same as the paper. And then we could change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followed this now. At the top of the models.py file , I have included the path its taken from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try "importing it directly with full history and committing it here"?
This seems to be copied over and re-committed which was the "at the very least"
Also, when you plan to finally finish e-mission/e-mission-eval-private-data#37 so that we can move on to
change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is further updated to include include models.py commit history.
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
Outdated
Show resolved
Hide resolved
1. switching a model is as simple as changing model_type in config file 2. ForestModel is now working. Main model is in model.py file which is copied from label_assist 3. TestRunForestModel.py is working. 3. Regression test in TestForestmodel.py are still under construction.
Removed redundancies and unnecessary code segments.
Config copy not required.
for attribute_name in attr: | ||
buffer=BytesIO() | ||
joblib.dump(getattr(self.model,attribute_name),buffer) | ||
buffer.seek(0) | ||
data[attribute_name]=buffer.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the recommended approach to serialize a random forest model? I have serialized random forest models before and this does not look familiar. Happy to keep it if it is in fact the currently recommended method, but need a citation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method implemented above utilizes joblib, which was a recommended practice, especially for scikit-learn models in their documentation. joblib and pickle were the only methods I knew, I went with jolib. However there is another recommendation skops
in the documentation. I didnt use it since I was unfamiliar with it.
Can move to skops
or any other way you recommend if required.
from sklearn.preprocessing import OneHotEncoder | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.impute import SimpleImputer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any testing done filled in.
If you are adding a new test, it would be helpful to add logs showing that it works (in addition to the CI passing)
And the testing is very incomplete - it is not at all clear that this code will actually work in production
And there is one very egregious error
And it would be good to import model.py with full history
Other than that, looks good.
:return: a vector containing features to predict from | ||
:rtype: List[float] | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add a code comment here about why this is "pass"
a normal expectation would be that all models would extract features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. Its because the pre implemented models.py file handles extraction in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is included in the upcoming commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this code comment in here yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done.
######################################################################## | ||
## Copied from /e-mission-eval-private-data/TRB_label_assist/models.py## | ||
######################################################################## | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this merged over with commit history? Does not appear to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, its Not. I was unaware of this way of moving files.
I'll do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this and was able to move models.py file with its commit history. However, the models.py
has its dependency on clustering.py
and data_wrangling.py
file in TRB_label_assist
.
I can similarly copy data_wrangling.py
( with its commit history) file since it doesn't seem to have any further dependencies.
However, clustering.py
has further dependencies on other files. But the functions ( from the clustering.py
) required by models.py
are get_distance_matrix
and single_cluster_purity
. None of these two functions have further dependencies . So I can copy the clustering.py ( along with the commit history) to e-mission-server and remove the unnecessary bit.
@shankari Let me know if this strategy work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so models.py
was not a direct copy from TRB_label_assist
, you edited it further?
I don't think you should copy clustering.py
over - you should refactor it to pull out the two functions that are dependencies and copy only the file with the pulled out code.
However, note that you have pending changes to clustering.py
e-mission/e-mission-eval-private-data#37
You should probably resolve those first, and get that PR merged before refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there
performance_eval.py
regenerate_classification_performance_results.py
This is particularly important if we want to retrain or reevaluate the model periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so models.py was not a direct copy from TRB_label_assist, you edited it further?
Yes, In the previous version (when I simply dumped the code) I removed all the classes not required by the RF model. I merged 'clustering.py' with the relevant class there and moved a few functions in 'utils.py' here in e-mission-server
.
In the current form, model.py
( with its commit history) is the same as in eval-private-data
with no changes.
However, note that you have pending changes to
clustering.py
e-mission/e-mission-eval-private-data#37You should probably resolve those first, and get that PR merged before refactoring.
The two functions get_distance_matrix
and single_cluster_purity
required here are not affected in that PR and will remain unaffected in future.
Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there.
We can do this. There's just clustering.py
, mapping.py
and data_wrangling.py
to move ( with their commit histories).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, as mentioned we'll have to wait for e-mission/e-mission-eval-private-data#37 to be merged before we begin this movement.
emission/core/wrapper/entry.py
Outdated
#necessary values required by forest model | ||
result_entry['data']['start_local_dt']=result_entry['metadata']['write_local_dt'] | ||
result_entry['data']['end_local_dt']=result_entry['metadata']['write_local_dt'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait what?! the trip's start and end timestamps are almost always guaranteed to not be the same as the time it was created. And by putting this into core/wrapper/entry.py
, you are adding this to every single entry of any type, created in any way.
This is just wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in create_fake_entry
function, which is used only for testing purpose
$ grep -rl create_fake_entry | grep -v __pycache__
./emission/core/wrapper/entry.py
./emission/tests/modellingTests/modellingTestAssets.py
Since start_local_dt
and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via
generate_mock_trips. For Random Forest model pipeline, they need to be part of the
Entry` data that we generate.
Since TestRunRandomForest
(which is just the first test file written) simply checks the running of the pipeline, any value for now would suffice.
These values are bound to evolve as we keep adding more and more tests as per the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since start_local_dt and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via generate_mock_trips. For Random Forest model pipeline, they need to be part of the Entry` data that we generate.
In this case, you should add the start_local_dt
and end_local_dt
to the call site for create_fake_entry
, not change the base implementation of create_fake_entry
in the core. With this implementation, every entry created for every potential purpose will have the hack required for one unfinished test case. This is wrong.
These values are bound to evolve as we keep adding more and more tests as per the requirements.
The values should evolve in the tests, not in the core.
"""these tests were copied forward during a refactor of the tour model | ||
[https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what the MAC failures, that happened several weeks ago, have to do with an extra file. You can always remove the file if it is no longer relevant. But it is not removed, so is it relevant or not?
I was complaining about:
(1) the class name being different from the file name
(2) the comment being incorrect (the tests were hopefully not copied forward during the tour model refactor)
"""these tests were copied forward during a refactor of the tour model | ||
[https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring this for now while waiting for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a fine start, but:
- I don't think it tests the code end to end.
It invokes the code usingimport emission.analysis.modelling.trip_model.run_model as eamur
, which is not the method that is run directly from the pipeline. I don't see where you are testing whether or not the pipeline invokes this code, and how well it works - I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I am focused on validating the core functionality of the model. This includes unit tests that are similar to the ones we have for other greedy binning modules, which will help us confirm that the model performs as expected in isolation.
I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.
I didn't think I'll need this, but I'll begin working on it .
Once I am confident that the model's internal operations are sound, I will proceed to develop tests for the integration with the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test if the model loads and stores correctly, these are the tests I am wiritng :
1.Round-Trip Test: Serialize an object with to_dict and then immediately deserialize it with from_dict. After deserialization, the object should have the same state as the original.
2.Consistency Test: To Check if the serialization and deserialization process is consistent across multiple executions.
3.Integrity Test: To Verify that the serialized dictionary contains all necessary keys and that the deserialized object has all required attributes initialized correctly.
4.Error Handling Test: I'll attempt to serialize and deserialize with incorrect or corrupted data to ensure that the methods handle errors gracefully.
5.Type Preservation Test: I'll check if the serialization and deserialization process maintains the data types of all model attributes.
6.Completeness Test: To ensure that no parts of the model are lost during serialization or deserialization.
The The The plan is this file will have the regression test. |
…' into SurveyAssistgetModel
Updating branch with changes from master. Moved mdoels from eval repo to server repo.
This will fail due to testForest.py file. Changes here include : 1. Integrated the shifting of randomForest model from eval to server. 2. unit tests for Model save and load 3. RegressionTest for RF model in testRandomForest.py.
try: | ||
if os.path.exists(file_path) and os.path.getsize(file_path)>0: | ||
with open(file_path, 'r') as f: | ||
prev_predictions_list = json.load(f) | ||
logging.debug() | ||
self.assertEqual(prev_predictions_list,curr_predictions_list," previous predictions should match current predictions") | ||
else: | ||
with open(file_path,'w') as file: | ||
json.dump(curr_predictions_list,file,indent=4) | ||
logging.debug("Previous predicitons stored for future matching" ) | ||
except json.JSONDecodeError: | ||
logging.debug("jsonDecodeErrorError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is questionable. I wanted to know if dumping ( writing predictions of first run ever to be used as ground truth) in a JSON here is a good practice or not?
Other option is to store the past predictions in db. But I think the tear-down will clean it up while exiting. So I opted for JSON. Is there a way to protect the writing. Let me know if I am missing something here.
raise Exception(f"test invariant failed, there should be no entries for user {self.user_id}") | ||
|
||
# load in trips from a test file source | ||
input_file = 'emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for this file because it was used in one other tests for greedySimilarityBinning. However, for the user mentioned (above), there are just 6 trips. I wanted to confirm with you if I am free to use other files from this location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other files in this location are used in other tests so I don't see any reason why they cannot be used. As you point out, these are daily snapshots, so they are unlikely to give very high quality results, but there is no restriction on their use.
The tests from |
This is replaced by models.py (movied with history)
Improving the test file by changing the way previpous predictions are stored.
tested. supposed to fail.working on it. |
Fixing circular import
Tested. Passed Tests |
Will Fail due to |
Although the current |
# for entry in entries: | ||
# self.assertGreater(len(entry["data"]["prediction"]), 0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is commented for now since currently, there are no predictions since there is no model being built in this test. However, once we bring in the model building, there should be predictions here.
Waiting for updates on the read/write and the high level discussion around why we are testing. |
The changes in this iteration are improvements in test for forest model : 1. Post discussion last week, the regression test was removed ( `TestForestModel.py` )since it won't be useful when model performance improves. Rather, the structures of predictions is checked. This check is merged with TestForestModel.py 2. After e-mission#944 , `predict_labels_with_n` in `run_model.py` expectes a lists and then iterates over it. The forest model and rest of the tests were updated accordingly.
Tested. All tests should pass. |
I believe this is in regards with the Forest model's read and write.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@humbleOldSage there are several comments that have not been addressed from the first review rounds. I have some additional comments on the tests as well. Please make sure that alll pending comments are addressed before resubmitting for review.
import emission.analysis.modelling.trip_model.config as eamtc | ||
import emission.storage.timeseries.builtin_timeseries as estb | ||
import emission.storage.decorations.trip_queries as esdtq | ||
from emission.analysis.modelling.trip_model.models import ForestClassifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see this form of import
Already flagged in #938 (comment)
and #938 (comment)
but not yet fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred this to be called something other than models.py
(maybe something like rf_for_label_models.py
but will not hold up the release for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done this yet since this file has other models as well ( this was brought in as it is from the e-mission-eval-private-data
). I can try to move other models to a separate file ( along with their blame history) and then change its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this file is currently unused, it is certainly not tested. Is there a reason we need to include it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is not needed. The file dbscan_svm.py
file was created before we moved all models related code from e-mission-eval-private-data
. After moving, this code is already present in models.py
. Will remove this file.
'max_features', | ||
'bootstrap', | ||
] | ||
cluster_expected_keys= [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have cluster_expected_keys
in the forest classifier file? The forest classifier should be separate from the cluster classifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was used when we cluster the coordinates (loc_cluster parameter = True) before passing to Random Forest. However, since we are working with configuration where loc_cluster is False
with RF , I'll mention the same there and comment this and other related parts.
self.model=ForestClassifier( loc_feature=config['loc_feature'], | ||
radius= config['radius'], | ||
size_thresh=config['radius'], | ||
purity_thresh=config['purity_thresh'], | ||
gamma=config['gamma'], | ||
C=config['C'], | ||
n_estimators=config['n_estimators'], | ||
criterion=config['criterion'], | ||
max_depth=maxdepth, | ||
min_samples_split=config['min_samples_split'], | ||
min_samples_leaf=config['min_samples_leaf'], | ||
max_features=config['max_features'], | ||
bootstrap=config['bootstrap'], | ||
random_state=config['random_state'], | ||
# drop_unclustered=False, | ||
use_start_clusters=config['use_start_clusters'], | ||
use_trip_clusters=config['use_trip_clusters']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just boilerplate code to copy parameters from one datastructure to another? Why not use something like kwargs
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I have used kwargs
now.
# test data can be saved between test invocations, check if data exists before generating | ||
ts = esta.TimeSeries.get_time_series(user_id) | ||
test_data = list(ts.find_entries(["analysis/confirmed_trip"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would test data be saved between test invocations? The tearDown method should clean up after the test is run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Here we are checking if there were any trips left after tear down of last run. The comment was misleading. I have changed the comment to imply the same.
test_data = esda.get_entries(key="analysis/confirmed_trip", user_id=user_id, time_query=None) | ||
if len(test_data) != self.total_trips: | ||
logging.debug(f'test invariant failed after generating test data') | ||
self.fail() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this out of the setup would also ensure that we don't have to call self.fall()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed an error on line 59.
moving this out of the setup would also ensure that we don't have to call
self.fall()
Since this is a part of setup, wouldn't it be better if we ensure that the setup is successful and then proceed towards our tests?
model_data=model.to_dict() | ||
loaded_model_type=eamumt.ModelType.RANDOM_FOREST_CLASSIFIER | ||
loaded_model = loaded_model_type.build(self.forest_model_config) | ||
loaded_model.from_dict(model_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused here. The model from load_stored_trip_model
is called model
and the model generated by building is loaded_model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved with with better identifiers
for attr in ['purpose_predictor','mode_predictor','replaced_predictor','purpose_enc','mode_enc','train_df']: | ||
assert isinstance(getattr(loaded_model.model,attr),type(getattr(model.model,attr))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you only checking types instead of actual values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included a pre and post serialization-deserialisation prediction assertion on a random user now.
eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build() | ||
# success if it didn't throw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not get the model and check some properties instead of just assuming success if it didn't throw?
e.g. something like the below (pseudo-code, do not blindly merge)
eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build() | |
# success if it didn't throw | |
built_model = eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build() | |
self.assertEqual(build_model.get_type(), eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.text) | |
# success if it didn't throw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now checking the type of built model, along with the types of other 5 attributes.
1. Improved tests in `TestForestModelLoadandSave.py` 2. Better comments, imports nd cleanup
Tagging @JGreenlee this is an MLOps change, we are changing the survey assist algorithm to use random forest instead of the old binning algorithm. As you can see, the code is set up to have a common structure for survey assist algorithms (find trips, train, save the model) and (load model, find prediction, store prediction). LMK if you need help understanding the structure. We already had an implementation for the binning algorithm, this is an implementation of a second algorithm. LMK if you have any high level questions related to this. Note that one UI implication for this is that, after the change, we will always have inferred labels for trips. With the previous binning algorithm, if it was a novel trip, there would not be any matches. But the random forest will generate predictions for every input, hopefully flagging that it is crappy. I believe we already ignore predictions with too low probability, but we should verify that it works properly end to end. Again, LMK if you have high-level questions on the underlying models |
While testing model integration, 2 forest model features specific features are added in the `TestForestModelIntegration.py` file rather than in `entry.py` file.
2 more ( total 4) Forest model specific features are now added after generating random trips for testing purpose.
Forst model specific values added in test setup for random Trips
test in testRandomForestTypePreservation as using `allceodata` specific user id. The tests on github use different db. Fixed by generating random samples.
@MukuFlash03 I would suggest:
Sending it over to @JGreenlee for peer review before it comes it me. |
1. Newline fixes - reverted removal of line a. util.py e-mission#938 (comment) b. run_model.py e-mission#938 (comment) 2. Import format changed to "import X as Y" instead of "from X import Y" Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
1. Newline fixes - reverted removal of line a. util.py e-mission#938 (comment) b. run_model.py e-mission#938 (comment) 2. Import format changed to "import X as Y" instead of "from X import Y" Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py e-mission#938 (comment)
1. Newline fixes - reverted removal of line a. util.py e-mission#938 (comment) b. run_model.py e-mission#938 (comment) 2. Import format changed to "import X as Y" instead of "from X import Y" Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py e-mission#938 (comment)
a. Removed check for remaining test data from previous test runs. - This should not be possible if data is cleared correctly in tearDown(). - Improved database clearing in tearDown() just to be sure. b. Moved model build to setup() since all tests need this - I did see Shankari's comment stating that model building is a heavyweight process (e-mission@104dd9a#r1486605432) - But it is anyways required by all tests and moving it to setup helps reduce duplicate code. c. Merged EqualityTest with Type Preservation test - Shankari had left a comment to check for values versus checking for types (e-mission#938 (comment)). - Satyam had added changes to check the predictions list after serialization and deserialization respectively. - However this equality test was already being done in a previous test. - Hence merged these two. d. Merged Serialization and Deserialization error handling test. - These tests were identical and mock functions were being used to assert raised exceptions. - Merged these as well. Merging tests helps reduce the number of types we have to build the model as for all tests the common steps involve building model and fetching predictions.
…tance variables 1. Split up mock trips data into train / test data. - Saw that this was being done in one of the tests in TestForestModelLoadandSave.py itself as well as in TestGreedySimilarityBinning.py - Hence added it for all tests in forest model tests for uniformity. 2. Reduced number of instance variables since they were used inside setUp() only. This addresses review comment mentioned originally for TestForestModelIntegration e-mission#938 (comment) 3. Cleaned up TestForestModeIntegration.py - Added equality tests that check for prediction values generated in pipeline. Address review comment: e-mission#938 (comment) - Added train / test data split. - Removed check for empty data in setUp() Addresses review comment: e-mission#938 (comment)
[PARTIALLY TESTED] RF initialization and fit function. Build test written and tested. fit of Random forest uses df . DBSCAN & SVM based approach added in case we want to switch from coordinates based to cluster based. Currently, in config we are passing coordinates.