Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Survey Assist Using RF #938

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

humbleOldSage
Copy link
Contributor

@humbleOldSage humbleOldSage commented Oct 2, 2023

[PARTIALLY TESTED] RF initialization and fit function. Build test written and tested. fit of Random forest uses df . DBSCAN & SVM based approach added in case we want to switch from coordinates based to cluster based. Currently, in config we are passing coordinates.

hlu109 and others added 8 commits August 12, 2022 19:33
…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model

`cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  now have no dependence on the custom branch. Results of plots  are attached to show no difference in theie previous and current outputs.
Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions.  Default set to 'none'.
…results.py`

Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual.

 Other minor fixes to support previous changes.
1. removed mentions of `tour_model` or `tour_model_first_only` .

2. removed two reads from database.

3. Removed notebook outputs  ( this could be the reason a few diffs are too big to view)
RF initialisation and fit function. Build test written and tested.  fit of Random forest uses df .
@humbleOldSage humbleOldSage marked this pull request as draft October 3, 2023 00:30
Predict is now included. Just need to figure out model storage and model testing.
Model loading and storing is now improved since it just stores the required predictors and encoders.

Regression  test and null value test included in tests.
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to include DBSCAN and clustering, but it should be in a different module instead of having a large bolus of code that is copy-pasted from one repo to another. DRY!

@@ -57,7 +57,6 @@ def update_trip_model(
logging.debug(f'time query for training data collection: {time_query}')

trips = _get_training_data(user_id, time_query)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see this extraneous whitespace

Copy link
Contributor Author

@humbleOldSage humbleOldSage Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

emission/analysis/modelling/trip_model/run_model.py Outdated Show resolved Hide resolved
Comment on lines 8 to 10
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally don't use from X import Y in our python code.
Please change to

Suggested change
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import sklearn as skl

or

Suggested change
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import sklearn.preprocessing as sklp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now moved to models.py.

Copy link
Contributor

@shankari shankari Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emission/analysis/modelling/trip_model/model_type.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is essentially copied over from
https://github.com/e-mission/e-mission-eval-private-data/blob/master/TRB_label_assist/models.py

At the very least, you should cite it.

Even better would be if you could import it directly with full history and commit it here directly so I could know that the code is the same as the paper. And then we could change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followed this now. At the top of the models.py file , I have included the path its taken from.

Copy link
Contributor

@shankari shankari Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try "importing it directly with full history and committing it here"?
This seems to be copied over and re-committed which was the "at the very least"

Also, when you plan to finally finish e-mission/e-mission-eval-private-data#37 so that we can move on to

change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is further updated to include include models.py commit history.

emission/tests/modellingTests/TestRunForestModel.py Outdated Show resolved Hide resolved
conf/analysis/trip_model.conf.json.sample Show resolved Hide resolved
1. switching a model is as simple as changing model_type in config file

2. ForestModel is now working. Main model is in model.py file which is copied from label_assist

3. TestRunForestModel.py is working.

3. Regression test in TestForestmodel.py are still under construction.
Removed redundancies and unnecessary code segments.
Config copy not required.
Comment on lines 130 to 134
for attribute_name in attr:
buffer=BytesIO()
joblib.dump(getattr(self.model,attribute_name),buffer)
buffer.seek(0)
data[attribute_name]=buffer.getvalue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the recommended approach to serialize a random forest model? I have serialized random forest models before and this does not look familiar. Happy to keep it if it is in fact the currently recommended method, but need a citation

Copy link
Contributor Author

@humbleOldSage humbleOldSage Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method implemented above utilizes joblib, which was a recommended practice, especially for scikit-learn models in their documentation. joblib and pickle were the only methods I knew, I went with jolib. However there is another recommendation skops in the documentation. I didnt use it since I was unfamiliar with it.

Can move to skops or any other way you recommend if required.

Comment on lines 8 to 10
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
Copy link
Contributor

@shankari shankari Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any testing done filled in.
If you are adding a new test, it would be helpful to add logs showing that it works (in addition to the CI passing)
And the testing is very incomplete - it is not at all clear that this code will actually work in production
And there is one very egregious error
And it would be good to import model.py with full history

Other than that, looks good.

:return: a vector containing features to predict from
:rtype: List[float]
"""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add a code comment here about why this is "pass"
a normal expectation would be that all models would extract features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. Its because the pre implemented models.py file handles extraction in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is included in the upcoming commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this code comment in here yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

Comment on lines 1 to 4
########################################################################
## Copied from /e-mission-eval-private-data/TRB_label_assist/models.py##
########################################################################

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this merged over with commit history? Does not appear to be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, its Not. I was unaware of this way of moving files.
I'll do it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this and was able to move models.py file with its commit history. However, the models.py has its dependency on clustering.py and data_wrangling.py file in TRB_label_assist.

I can similarly copy data_wrangling.py( with its commit history) file since it doesn't seem to have any further dependencies.

However, clustering.py has further dependencies on other files. But the functions ( from the clustering.py) required by models.py are get_distance_matrix and single_cluster_purity . None of these two functions have further dependencies . So I can copy the clustering.py ( along with the commit history) to e-mission-server and remove the unnecessary bit.

@shankari Let me know if this strategy work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so models.py was not a direct copy from TRB_label_assist, you edited it further?
I don't think you should copy clustering.py over - you should refactor it to pull out the two functions that are dependencies and copy only the file with the pulled out code.
However, note that you have pending changes to clustering.py
e-mission/e-mission-eval-private-data#37

You should probably resolve those first, and get that PR merged before refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there

performance_eval.py
regenerate_classification_performance_results.py

This is particularly important if we want to retrain or reevaluate the model periodically.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so models.py was not a direct copy from TRB_label_assist, you edited it further?

Yes, In the previous version (when I simply dumped the code) I removed all the classes not required by the RF model. I merged 'clustering.py' with the relevant class there and moved a few functions in 'utils.py' here in e-mission-server.

In the current form, model.py( with its commit history) is the same as in eval-private-data with no changes.

However, note that you have pending changes to clustering.py e-mission/e-mission-eval-private-data#37

You should probably resolve those first, and get that PR merged before refactoring.

The two functions get_distance_matrix and single_cluster_purity required here are not affected in that PR and will remain unaffected in future.

Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there.

We can do this. There's just clustering.py , mapping.py and data_wrangling.py to move ( with their commit histories).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, as mentioned we'll have to wait for e-mission/e-mission-eval-private-data#37 to be merged before we begin this movement.

Comment on lines 185 to 187
#necessary values required by forest model
result_entry['data']['start_local_dt']=result_entry['metadata']['write_local_dt']
result_entry['data']['end_local_dt']=result_entry['metadata']['write_local_dt']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait what?! the trip's start and end timestamps are almost always guaranteed to not be the same as the time it was created. And by putting this into core/wrapper/entry.py, you are adding this to every single entry of any type, created in any way.

This is just wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in create_fake_entry function, which is used only for testing purpose

$ grep -rl create_fake_entry | grep -v __pycache__
./emission/core/wrapper/entry.py
./emission/tests/modellingTests/modellingTestAssets.py

Since start_local_dt and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via generate_mock_trips. For Random Forest model pipeline, they need to be part of the Entry` data that we generate.

Since TestRunRandomForest (which is just the first test file written) simply checks the running of the pipeline, any value for now would suffice.

These values are bound to evolve as we keep adding more and more tests as per the requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since start_local_dt and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via generate_mock_trips. For Random Forest model pipeline, they need to be part of the Entry` data that we generate.

In this case, you should add the start_local_dt and end_local_dt to the call site for create_fake_entry, not change the base implementation of create_fake_entry in the core. With this implementation, every entry created for every potential purpose will have the hack required for one unfinished test case. This is wrong.

These values are bound to evolve as we keep adding more and more tests as per the requirements.

The values should evolve in the tests, not in the core.

Comment on lines 16 to 17
"""these tests were copied forward during a refactor of the tour model
[https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what the MAC failures, that happened several weeks ago, have to do with an extra file. You can always remove the file if it is no longer relevant. But it is not removed, so is it relevant or not?

I was complaining about:
(1) the class name being different from the file name
(2) the comment being incorrect (the tests were hopefully not copied forward during the tour model refactor)

Comment on lines 16 to 17
"""these tests were copied forward during a refactor of the tour model
[https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring this for now while waiting for clarification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a fine start, but:

  • I don't think it tests the code end to end.
    It invokes the code using import emission.analysis.modelling.trip_model.run_model as eamur, which is not the method that is run directly from the pipeline. I don't see where you are testing whether or not the pipeline invokes this code, and how well it works
  • I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I am focused on validating the core functionality of the model. This includes unit tests that are similar to the ones we have for other greedy binning modules, which will help us confirm that the model performs as expected in isolation.

I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.

I didn't think I'll need this, but I'll begin working on it .

Once I am confident that the model's internal operations are sound, I will proceed to develop tests for the integration with the pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test if the model loads and stores correctly, these are the tests I am wiritng :

1.Round-Trip Test: Serialize an object with to_dict and then immediately deserialize it with from_dict. After deserialization, the object should have the same state as the original.

2.Consistency Test: To Check if the serialization and deserialization process is consistent across multiple executions.

3.Integrity Test: To Verify that the serialized dictionary contains all necessary keys and that the deserialized object has all required attributes initialized correctly.

4.Error Handling Test: I'll attempt to serialize and deserialize with incorrect or corrupted data to ensure that the methods handle errors gracefully.

5.Type Preservation Test: I'll check if the serialization and deserialization process maintains the data types of all model attributes.

6.Completeness Test: To ensure that no parts of the model are lost during serialization or deserialization.

@humbleOldSage
Copy link
Contributor Author

The TestForestModel.py is a relevant file, however the code there is not in its final form.

The TestForestModel.py file has tests that I wrote on my personal system with no access to the data and so no way to test if whatever I wrote worked or not. I haven't been able to come back to this file and further work on it completely. I have been committing whatever little bits and pieces of changes I am able to do in this file so that I have latest work on the next system I switch to. I have changed office machines thrice in the past 3 weeks.

The plan is this file will have the regression test.
For now, the contents of this file must be ignored.

Updating branch with changes from master. Moved mdoels from eval repo to server repo.
This will fail due to testForest.py file.
Changes here include :

1. Integrated the shifting of randomForest model from eval to server.

2. unit tests for Model save and load

3. RegressionTest for RF model in testRandomForest.py.
Comment on lines 121 to 132
try:
if os.path.exists(file_path) and os.path.getsize(file_path)>0:
with open(file_path, 'r') as f:
prev_predictions_list = json.load(f)
logging.debug()
self.assertEqual(prev_predictions_list,curr_predictions_list," previous predictions should match current predictions")
else:
with open(file_path,'w') as file:
json.dump(curr_predictions_list,file,indent=4)
logging.debug("Previous predicitons stored for future matching" )
except json.JSONDecodeError:
logging.debug("jsonDecodeErrorError")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is questionable. I wanted to know if dumping ( writing predictions of first run ever to be used as ground truth) in a JSON here is a good practice or not?

Other option is to store the past predictions in db. But I think the tear-down will clean it up while exiting. So I opted for JSON. Is there a way to protect the writing. Let me know if I am missing something here.

raise Exception(f"test invariant failed, there should be no entries for user {self.user_id}")

# load in trips from a test file source
input_file = 'emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for this file because it was used in one other tests for greedySimilarityBinning. However, for the user mentioned (above), there are just 6 trips. I wanted to confirm with you if I am free to use other files from this location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other files in this location are used in other tests so I don't see any reason why they cannot be used. As you point out, these are daily snapshots, so they are unlikely to give very high quality results, but there is no restriction on their use.

@humbleOldSage
Copy link
Contributor Author

The tests from TestForestModel.py file will fail.

This is replaced by models.py (movied with history)
Improving the test file by changing the way previpous predictions are stored.
@humbleOldSage
Copy link
Contributor Author

tested. supposed to fail.working on it.

Fixing circular import
@humbleOldSage
Copy link
Contributor Author

Tested. Passed Tests

@humbleOldSage
Copy link
Contributor Author

Will Fail due to TestForestModelIntegration.py.

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Jan 10, 2024

Although the current TestForestModelIntegration.py test works, I believe a better way would be to load data ( possibly real world) and then build a model during the test and then run the prediction pipeline entirely and see if we got predictions at the infer_label stage .

Comment on lines 49 to 51
# for entry in entries:
# self.assertGreater(len(entry["data"]["prediction"]), 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented for now since currently, there are no predictions since there is no model being built in this test. However, once we bring in the model building, there should be predictions here.

@shankari
Copy link
Contributor

shankari commented Feb 2, 2024

Waiting for updates on the read/write and the high level discussion around why we are testing.

The changes in this iteration are improvements in test for forest model :

1. Post discussion last week, the regression test was removed ( `TestForestModel.py` )since it won't be useful when model performance improves. Rather, the structures of predictions is checked. This check is merged with TestForestModel.py

2. After e-mission#944 , `predict_labels_with_n` in `run_model.py` expectes a lists and then iterates over it. The forest model and rest of the tests were updated accordingly.
@humbleOldSage
Copy link
Contributor Author

Tested. All tests should pass.

@humbleOldSage humbleOldSage marked this pull request as ready for review February 5, 2024 06:32
@humbleOldSage
Copy link
Contributor Author

.... updates on the read/write

I believe this is in regards with the Forest model's read and write. TestForestModelLoadandSave.py handles this where I tried to cover testing as per my comment above. #938 (comment) .

and the high level discussion around why we are testing.

  1. Foremost, to make sure that the introduced feature ( forest model in this case) works as intended, even for unusual and edge cases.
  2. Integration testing to ensure it works with existing pipeline and doesn't introduce any other issues.
  3. I guess using real world data is a good practice to see how feature will perform for end user.
  4. Additionally, the real world example uncovers certain complexities that might not be visible in simplified test cases.
  5. In our case, we can get idea about the accuracy and reliability of the results that user gets.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humbleOldSage there are several comments that have not been addressed from the first review rounds. I have some additional comments on the tests as well. Please make sure that alll pending comments are addressed before resubmitting for review.

import emission.analysis.modelling.trip_model.config as eamtc
import emission.storage.timeseries.builtin_timeseries as estb
import emission.storage.decorations.trip_queries as esdtq
from emission.analysis.modelling.trip_model.models import ForestClassifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see this form of import
Already flagged in #938 (comment)
and #938 (comment)
but not yet fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIxed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred this to be called something other than models.py (maybe something like rf_for_label_models.py but will not hold up the release for this change.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done this yet since this file has other models as well ( this was brought in as it is from the e-mission-eval-private-data ). I can try to move other models to a separate file ( along with their blame history) and then change its name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this file is currently unused, it is certainly not tested. Is there a reason we need to include it in this PR?

Copy link
Contributor Author

@humbleOldSage humbleOldSage Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is not needed. The file dbscan_svm.py file was created before we moved all models related code from e-mission-eval-private-data. After moving, this code is already present in models.py. Will remove this file.

'max_features',
'bootstrap',
]
cluster_expected_keys= [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have cluster_expected_keys in the forest classifier file? The forest classifier should be separate from the cluster classifier.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used when we cluster the coordinates (loc_cluster parameter = True) before passing to Random Forest. However, since we are working with configuration where loc_cluster is False with RF , I'll mention the same there and comment this and other related parts.

Comment on lines 58 to 74
self.model=ForestClassifier( loc_feature=config['loc_feature'],
radius= config['radius'],
size_thresh=config['radius'],
purity_thresh=config['purity_thresh'],
gamma=config['gamma'],
C=config['C'],
n_estimators=config['n_estimators'],
criterion=config['criterion'],
max_depth=maxdepth,
min_samples_split=config['min_samples_split'],
min_samples_leaf=config['min_samples_leaf'],
max_features=config['max_features'],
bootstrap=config['bootstrap'],
random_state=config['random_state'],
# drop_unclustered=False,
use_start_clusters=config['use_start_clusters'],
use_trip_clusters=config['use_trip_clusters'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just boilerplate code to copy parameters from one datastructure to another? Why not use something like kwargs instead?

Copy link
Contributor Author

@humbleOldSage humbleOldSage Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I have used kwargs now.

Comment on lines 43 to 45
# test data can be saved between test invocations, check if data exists before generating
ts = esta.TimeSeries.get_time_series(user_id)
test_data = list(ts.find_entries(["analysis/confirmed_trip"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would test data be saved between test invocations? The tearDown method should clean up after the test is run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Here we are checking if there were any trips left after tear down of last run. The comment was misleading. I have changed the comment to imply the same.

Comment on lines +74 to +78
test_data = esda.get_entries(key="analysis/confirmed_trip", user_id=user_id, time_query=None)
if len(test_data) != self.total_trips:
logging.debug(f'test invariant failed after generating test data')
self.fail()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this out of the setup would also ensure that we don't have to call self.fall()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed an error on line 59.

moving this out of the setup would also ensure that we don't have to call self.fall()

Since this is a part of setup, wouldn't it be better if we ensure that the setup is successful and then proceed towards our tests?

Comment on lines 305 to 308
model_data=model.to_dict()
loaded_model_type=eamumt.ModelType.RANDOM_FOREST_CLASSIFIER
loaded_model = loaded_model_type.build(self.forest_model_config)
loaded_model.from_dict(model_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused here. The model from load_stored_trip_model is called model and the model generated by building is loaded_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved with with better identifiers

Comment on lines 311 to 313
for attr in ['purpose_predictor','mode_predictor','replaced_predictor','purpose_enc','mode_enc','train_df']:
assert isinstance(getattr(loaded_model.model,attr),type(getattr(model.model,attr)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you only checking types instead of actual values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included a pre and post serialization-deserialisation prediction assertion on a random user now.

Comment on lines 95 to 96
eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()
# success if it didn't throw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not get the model and check some properties instead of just assuming success if it didn't throw?
e.g. something like the below (pseudo-code, do not blindly merge)

Suggested change
eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()
# success if it didn't throw
built_model = eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()
self.assertEqual(build_model.get_type(), eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.text)
# success if it didn't throw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now checking the type of built model, along with the types of other 5 attributes.

1. Improved tests in `TestForestModelLoadandSave.py`

2. Better comments, imports nd cleanup
@shankari
Copy link
Contributor

Tagging @JGreenlee this is an MLOps change, we are changing the survey assist algorithm to use random forest instead of the old binning algorithm. As you can see, the code is set up to have a common structure for survey assist algorithms (find trips, train, save the model) and (load model, find prediction, store prediction). LMK if you need help understanding the structure.

We already had an implementation for the binning algorithm, this is an implementation of a second algorithm. LMK if you have any high level questions related to this.

Note that one UI implication for this is that, after the change, we will always have inferred labels for trips. With the previous binning algorithm, if it was a novel trip, there would not be any matches. But the random forest will generate predictions for every input, hopefully flagging that it is crappy. I believe we already ignore predictions with too low probability, but we should verify that it works properly end to end.

Again, LMK if you have high-level questions on the underlying models

While testing model integration, 2 forest model features specific features are added in the `TestForestModelIntegration.py` file rather than in `entry.py` file.
2 more ( total 4) Forest model specific features are now added after generating random trips for testing purpose.
Forst model specific values added in test setup for random Trips
test in testRandomForestTypePreservation as using `allceodata` specific user id. The tests on github use different db. Fixed by generating random samples.
@shankari
Copy link
Contributor

@MukuFlash03 I would suggest:

  • starting with a review of the existing code
  • cross-correlating it with my comments
  • addressing my comments

Sending it over to @JGreenlee for peer review before it comes it me.

MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Nov 18, 2024
1. Newline fixes - reverted removal of line
a. util.py
e-mission#938 (comment)

b. run_model.py
e-mission#938 (comment)

2. Import format changed to "import X as Y" instead of "from X import Y"
Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Nov 18, 2024
1. Newline fixes - reverted removal of line
a. util.py
e-mission#938 (comment)

b. run_model.py
e-mission#938 (comment)

2. Import format changed to "import X as Y" instead of "from X import Y"
Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
e-mission#938 (comment)
MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Nov 18, 2024
1. Newline fixes - reverted removal of line
a. util.py
e-mission#938 (comment)

b. run_model.py
e-mission#938 (comment)

2. Import format changed to "import X as Y" instead of "from X import Y"
Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py
e-mission#938 (comment)
MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Nov 21, 2024
a. Removed check for remaining test data from previous test runs.
- This should not be possible if data is cleared correctly in tearDown().
- Improved database clearing in tearDown() just to be sure.

b. Moved model build to setup() since all tests need this
- I did see Shankari's comment stating that model building is a heavyweight process (e-mission@104dd9a#r1486605432)
- But it is anyways required by all tests and moving it to setup helps reduce duplicate code.

c. Merged EqualityTest with Type Preservation test
- Shankari had left a comment to check for values versus checking for types (e-mission#938 (comment)).
- Satyam had added changes to check the predictions list after serialization and deserialization respectively.
- However this equality test was already being done in a previous test.
- Hence merged these two.

d. Merged Serialization and Deserialization error handling test.
- These tests were identical and mock functions were being used to assert raised exceptions.
- Merged these as well.

Merging tests helps reduce the number of types we have to build the model as for all tests the common steps involve building model and fetching predictions.
MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Nov 22, 2024
…tance variables

1. Split up mock trips data into train / test data.
- Saw that this was being done in one of the tests in TestForestModelLoadandSave.py itself as well as in TestGreedySimilarityBinning.py
- Hence added it for all tests in forest model tests for uniformity.

2. Reduced number of instance variables since they were used inside setUp() only.
This addresses review comment mentioned originally for TestForestModelIntegration
e-mission#938 (comment)

3. Cleaned up TestForestModeIntegration.py
- Added equality tests that check for prediction values generated in pipeline.
Address review comment:
e-mission#938 (comment)

- Added train / test data split.

- Removed check for empty data in setUp()
Addresses review comment:
e-mission#938 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Issues being worked on
Development

Successfully merging this pull request may close these issues.

3 participants