Survey Assist Using RF #938

humbleOldSage · 2023-10-02T13:38:11Z

[PARTIALLY TESTED] RF initialization and fit function. Build test written and tested. fit of Random forest uses df . DBSCAN & SVM based approach added in case we want to switch from coordinates based to cluster based. Currently, in config we are passing coordinates.

…result generation more carefully

…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` now have no dependence on the custom branch. Results of plots are attached to show no difference in theie previous and current outputs.

Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions. Default set to 'none'.

…results.py` Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual. Other minor fixes to support previous changes.

1. removed mentions of `tour_model` or `tour_model_first_only` . 2. removed two reads from database. 3. Removed notebook outputs ( this could be the reason a few diffs are too big to view)

RF initialisation and fit function. Build test written and tested. fit of Random forest uses df .

Predict is now included. Just need to figure out model storage and model testing.

Model loading and storing is now improved since it just stores the required predictors and encoders. Regression test and null value test included in tests.

shankari

It is fine to include DBSCAN and clustering, but it should be in a different module instead of having a large bolus of code that is copy-pasted from one repo to another. DRY!

shankari · 2023-10-21T19:10:10Z

emission/analysis/modelling/trip_model/run_model.py

@@ -57,7 +57,6 @@ def update_trip_model(
        logging.debug(f'time query for training data collection: {time_query}')

        trips = _get_training_data(user_id, time_query)
-


extraneous whitespace.

I still see this extraneous whitespace

emission/analysis/modelling/trip_model/run_model.py

shankari · 2023-10-21T19:12:13Z

emission/analysis/modelling/trip_model/util.py

+from sklearn.preprocessing import OneHotEncoder
+from sklearn.pipeline import make_pipeline
+from sklearn.impute import SimpleImputer


we generally don't use from X import Y in our python code.
Please change to

Suggested change

from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import make_pipeline

from sklearn.impute import SimpleImputer

import sklearn as skl

or

Suggested change

from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import make_pipeline

from sklearn.impute import SimpleImputer

import sklearn.preprocessing as sklp

this is now moved to models.py.

I still see this form of import
https://github.com/e-mission/e-mission-server/pull/938/files#diff-9fa11fd632d4ab408567f01777eef3ae0a89b8f66af3cda8b749c58bd0d3d4b7R4

emission/analysis/modelling/trip_model/model_type.py

shankari · 2023-10-21T19:29:27Z

emission/analysis/modelling/trip_model/util.py

all of this is essentially copied over from
https://github.com/e-mission/e-mission-eval-private-data/blob/master/TRB_label_assist/models.py

At the very least, you should cite it.

Even better would be if you could import it directly with full history and commit it here directly so I could know that the code is the same as the paper. And then we could change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.

followed this now. At the top of the models.py file , I have included the path its taken from.

Did you try "importing it directly with full history and committing it here"?
This seems to be copied over and re-committed which was the "at the very least"

Also, when you plan to finally finish e-mission/e-mission-eval-private-data#37 so that we can move on to

change the code in the notebook to call the code from the server, so that again, we will be testing the code that we publish.

This is further updated to include include models.py commit history.

emission/analysis/modelling/trip_model/greedy_similarity_binning.py

emission/tests/modellingTests/TestRunForestModel.py

conf/analysis/trip_model.conf.json.sample

emission/analysis/modelling/trip_model/forest_classifier.py

1. switching a model is as simple as changing model_type in config file 2. ForestModel is now working. Main model is in model.py file which is copied from label_assist 3. TestRunForestModel.py is working. 3. Regression test in TestForestmodel.py are still under construction.

Removed redundancies and unnecessary code segments.

Config copy not required.

emission/analysis/modelling/trip_model/forest_classifier.py

shankari · 2023-11-03T04:48:33Z

emission/analysis/modelling/trip_model/forest_classifier.py

+        for attribute_name in attr:
+            buffer=BytesIO()
+            joblib.dump(getattr(self.model,attribute_name),buffer)
+            buffer.seek(0)
+            data[attribute_name]=buffer.getvalue()


Is this the recommended approach to serialize a random forest model? I have serialized random forest models before and this does not look familiar. Happy to keep it if it is in fact the currently recommended method, but need a citation

The method implemented above utilizes joblib, which was a recommended practice, especially for scikit-learn models in their documentation. joblib and pickle were the only methods I knew, I went with jolib. However there is another recommendation skops in the documentation. I didnt use it since I was unfamiliar with it.

Can move to skops or any other way you recommend if required.

shankari · 2023-11-03T04:57:16Z

emission/analysis/modelling/trip_model/util.py

+from sklearn.preprocessing import OneHotEncoder
+from sklearn.pipeline import make_pipeline
+from sklearn.impute import SimpleImputer


I still see this form of import
https://github.com/e-mission/e-mission-server/pull/938/files#diff-9fa11fd632d4ab408567f01777eef3ae0a89b8f66af3cda8b749c58bd0d3d4b7R4

shankari

I don't see any testing done filled in.
If you are adding a new test, it would be helpful to add logs showing that it works (in addition to the CI passing)
And the testing is very incomplete - it is not at all clear that this code will actually work in production
And there is one very egregious error
And it would be good to import model.py with full history

Other than that, looks good.

shankari · 2023-11-03T04:59:03Z

emission/analysis/modelling/trip_model/forest_classifier.py

+        :return: a vector containing features to predict from
+        :rtype: List[float]
+        """
+        pass


you should add a code comment here about why this is "pass"
a normal expectation would be that all models would extract features.

will do. Its because the pre implemented models.py file handles extraction in this case.

This is included in the upcoming commit.

I don't see this code comment in here yet

This is done.

emission/analysis/modelling/trip_model/model_type.py

shankari · 2023-11-03T05:01:39Z

emission/analysis/modelling/trip_model/models.py

+########################################################################
+## Copied from /e-mission-eval-private-data/TRB_label_assist/models.py##
+########################################################################
+


Is this merged over with commit history? Does not appear to be

Currently, its Not. I was unaware of this way of moving files.
I'll do it now.

I did this and was able to move models.py file with its commit history. However, the models.py has its dependency on clustering.py and data_wrangling.py file in TRB_label_assist.

I can similarly copy data_wrangling.py( with its commit history) file since it doesn't seem to have any further dependencies.

However, clustering.py has further dependencies on other files. But the functions ( from the clustering.py) required by models.py are get_distance_matrix and single_cluster_purity . None of these two functions have further dependencies . So I can copy the clustering.py ( along with the commit history) to e-mission-server and remove the unnecessary bit.

@shankari Let me know if this strategy work

ok, so models.py was not a direct copy from TRB_label_assist, you edited it further?
I don't think you should copy clustering.py over - you should refactor it to pull out the two functions that are dependencies and copy only the file with the pulled out code.
However, note that you have pending changes to clustering.py
e-mission/e-mission-eval-private-data#37

You should probably resolve those first, and get that PR merged before refactoring.

Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there

performance_eval.py regenerate_classification_performance_results.py

This is particularly important if we want to retrain or reevaluate the model periodically.

ok, so models.py was not a direct copy from TRB_label_assist, you edited it further?

Yes, In the previous version (when I simply dumped the code) I removed all the classes not required by the RF model. I merged 'clustering.py' with the relevant class there and moved a few functions in 'utils.py' here in e-mission-server.

In the current form, model.py( with its commit history) is the same as in eval-private-data with no changes.

However, note that you have pending changes to clustering.py e-mission/e-mission-eval-private-data#37

You should probably resolve those first, and get that PR merged before refactoring.

The two functions get_distance_matrix and single_cluster_purity required here are not affected in that PR and will remain unaffected in future.

Note that I am also fine with importing the bulk of the python files implementing the algorithm in here. And just having the notebooks and the evaluation code in there.

We can do this. There's just clustering.py , mapping.py and data_wrangling.py to move ( with their commit histories).

However, as mentioned we'll have to wait for e-mission/e-mission-eval-private-data#37 to be merged before we begin this movement.

shankari · 2023-11-03T05:08:28Z

emission/core/wrapper/entry.py

+      #necessary values required by forest model
+      result_entry['data']['start_local_dt']=result_entry['metadata']['write_local_dt']
+      result_entry['data']['end_local_dt']=result_entry['metadata']['write_local_dt']


wait what?! the trip's start and end timestamps are almost always guaranteed to not be the same as the time it was created. And by putting this into core/wrapper/entry.py, you are adding this to every single entry of any type, created in any way.

This is just wrong.

This is done in create_fake_entry function, which is used only for testing purpose

$ grep -rl create_fake_entry | grep -v __pycache__ ./emission/core/wrapper/entry.py ./emission/tests/modellingTests/modellingTestAssets.py

Since start_local_dt and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via generate_mock_trips. For Random Forest model pipeline, they need to be part of the Entry` data that we generate.

Since TestRunRandomForest (which is just the first test file written) simply checks the running of the pipeline, any value for now would suffice.

These values are bound to evolve as we keep adding more and more tests as per the requirements.

Since start_local_dt and end_local_dt' components were not required for greedy binning, they were not being added to the trips added via generate_mock_trips. For Random Forest model pipeline, they need to be part of the Entry` data that we generate.

In this case, you should add the start_local_dt and end_local_dt to the call site for create_fake_entry, not change the base implementation of create_fake_entry in the core. With this implementation, every entry created for every potential purpose will have the hack required for one unfinished test case. This is wrong.

These values are bound to evolve as we keep adding more and more tests as per the requirements.

The values should evolve in the tests, not in the core.

shankari · 2023-11-03T05:12:44Z

emission/tests/modellingTests/TestForestModel.py

+    """these tests were copied forward during a refactor of the tour model
+    [https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114]


I don't see what the MAC failures, that happened several weeks ago, have to do with an extra file. You can always remove the file if it is no longer relevant. But it is not removed, so is it relevant or not?

I was complaining about:
(1) the class name being different from the file name
(2) the comment being incorrect (the tests were hopefully not copied forward during the tour model refactor)

shankari · 2023-11-03T05:15:16Z

emission/tests/modellingTests/TestForestModel.py

+    """these tests were copied forward during a refactor of the tour model
+    [https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114]


Ignoring this for now while waiting for clarification.

shankari · 2023-11-03T05:30:22Z

emission/tests/modellingTests/TestRunForestModel.py

This test is a fine start, but:

I don't think it tests the code end to end.
It invokes the code using import emission.analysis.modelling.trip_model.run_model as eamur, which is not the method that is run directly from the pipeline. I don't see where you are testing whether or not the pipeline invokes this code, and how well it works

I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.

Currently, I am focused on validating the core functionality of the model. This includes unit tests that are similar to the ones we have for other greedy binning modules, which will help us confirm that the model performs as expected in isolation.

I also don't see any testing around load/store. You have a completely new load and store method, how do you know that they work, and how do you know that the data stored is sufficient to predict properly when loaded.

I didn't think I'll need this, but I'll begin working on it .

Once I am confident that the model's internal operations are sound, I will proceed to develop tests for the integration with the pipeline.

To test if the model loads and stores correctly, these are the tests I am wiritng :

1.Round-Trip Test: Serialize an object with to_dict and then immediately deserialize it with from_dict. After deserialization, the object should have the same state as the original.

2.Consistency Test: To Check if the serialization and deserialization process is consistent across multiple executions.

3.Integrity Test: To Verify that the serialized dictionary contains all necessary keys and that the deserialized object has all required attributes initialized correctly.

4.Error Handling Test: I'll attempt to serialize and deserialize with incorrect or corrupted data to ensure that the methods handle errors gracefully.

5.Type Preservation Test: I'll check if the serialization and deserialization process maintains the data types of all model attributes.

6.Completeness Test: To ensure that no parts of the model are lost during serialization or deserialization.

humbleOldSage · 2023-11-03T15:30:14Z

The TestForestModel.py is a relevant file, however the code there is not in its final form.

The TestForestModel.py file has tests that I wrote on my personal system with no access to the data and so no way to test if whatever I wrote worked or not. I haven't been able to come back to this file and further work on it completely. I have been committing whatever little bits and pieces of changes I am able to do in this file so that I have latest work on the next system I switch to. I have changed office machines thrice in the past 3 weeks.

The plan is this file will have the regression test.
For now, the contents of this file must be ignored.

…' into SurveyAssistgetModel

Updating branch with changes from master. Moved mdoels from eval repo to server repo.

This will fail due to testForest.py file. Changes here include : 1. Integrated the shifting of randomForest model from eval to server. 2. unit tests for Model save and load 3. RegressionTest for RF model in testRandomForest.py.

humbleOldSage · 2023-12-09T05:10:01Z

emission/tests/modellingTests/TestForestModel.py

+        try:
+            if os.path.exists(file_path) and os.path.getsize(file_path)>0:
+                with open(file_path, 'r') as f:
+                    prev_predictions_list = json.load(f)
+                    logging.debug()
+                    self.assertEqual(prev_predictions_list,curr_predictions_list," previous predictions should match current predictions")
+            else:
+                with open(file_path,'w') as file:
+                    json.dump(curr_predictions_list,file,indent=4)
+                    logging.debug("Previous predicitons stored for future matching" )
+        except json.JSONDecodeError:
+            logging.debug("jsonDecodeErrorError")


This part is questionable. I wanted to know if dumping ( writing predictions of first run ever to be used as ground truth) in a JSON here is a good practice or not?

Other option is to store the past predictions in db. But I think the tear-down will clean it up while exiting. So I opted for JSON. Is there a way to protect the writing. Let me know if I am missing something here.

humbleOldSage · 2023-12-09T05:11:47Z

emission/tests/modellingTests/TestForestModel.py

+            raise Exception(f"test invariant failed, there should be no entries for user {self.user_id}")
+
+        # load in trips from a test file source
+        input_file = 'emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips'


I opted for this file because it was used in one other tests for greedySimilarityBinning. However, for the user mentioned (above), there are just 6 trips. I wanted to confirm with you if I am free to use other files from this location.

The other files in this location are used in other tests so I don't see any reason why they cannot be used. As you point out, these are daily snapshots, so they are unlikely to give very high quality results, but there is no restriction on their use.

humbleOldSage · 2023-12-09T05:13:34Z

The tests from TestForestModel.py file will fail.

This is replaced by models.py (movied with history)

Improving the test file by changing the way previpous predictions are stored.

humbleOldSage · 2023-12-13T22:49:59Z

tested. supposed to fail.working on it.

Fixing circular import

humbleOldSage · 2023-12-16T18:52:31Z

Tested. Passed Tests

humbleOldSage · 2024-01-02T23:55:00Z

Will Fail due to TestForestModelIntegration.py.

humbleOldSage · 2024-01-10T20:58:15Z

Although the current TestForestModelIntegration.py test works, I believe a better way would be to load data ( possibly real world) and then build a model during the test and then run the prediction pipeline entirely and see if we got predictions at the infer_label stage .

humbleOldSage · 2024-01-10T21:06:08Z

emission/tests/modellingTests/TestForestModelIntegration.py

+            # for entry in entries:
+            #     self.assertGreater(len(entry["data"]["prediction"]), 0)
+


This is commented for now since currently, there are no predictions since there is no model being built in this test. However, once we bring in the model building, there should be predictions here.

shankari · 2024-02-02T22:04:56Z

Waiting for updates on the read/write and the high level discussion around why we are testing.

The changes in this iteration are improvements in test for forest model : 1. Post discussion last week, the regression test was removed ( `TestForestModel.py` )since it won't be useful when model performance improves. Rather, the structures of predictions is checked. This check is merged with TestForestModel.py 2. After e-mission#944 , `predict_labels_with_n` in `run_model.py` expectes a lists and then iterates over it. The forest model and rest of the tests were updated accordingly.

humbleOldSage · 2024-02-05T06:31:42Z

Tested. All tests should pass.

humbleOldSage · 2024-02-05T06:54:25Z

.... updates on the read/write

I believe this is in regards with the Forest model's read and write. TestForestModelLoadandSave.py handles this where I tried to cover testing as per my comment above. #938 (comment) .

and the high level discussion around why we are testing.

Foremost, to make sure that the introduced feature ( forest model in this case) works as intended, even for unusual and edge cases.
Integration testing to ensure it works with existing pipeline and doesn't introduce any other issues.
I guess using real world data is a good practice to see how feature will perform for end user.
Additionally, the real world example uncovers certain complexities that might not be visible in simplified test cases.
In our case, we can get idea about the accuracy and reliability of the results that user gets.

shankari

@humbleOldSage there are several comments that have not been addressed from the first review rounds. I have some additional comments on the tests as well. Please make sure that alll pending comments are addressed before resubmitting for review.

shankari · 2024-02-12T04:47:38Z

emission/analysis/modelling/trip_model/forest_classifier.py

+import emission.analysis.modelling.trip_model.config as eamtc
+import emission.storage.timeseries.builtin_timeseries as estb
+import emission.storage.decorations.trip_queries as esdtq
+from emission.analysis.modelling.trip_model.models import ForestClassifier


I still see this form of import
Already flagged in #938 (comment)
and #938 (comment)
but not yet fixed

shankari · 2024-02-12T05:50:28Z

emission/analysis/modelling/trip_model/models.py

I would have preferred this to be called something other than models.py (maybe something like rf_for_label_models.py but will not hold up the release for this change.

I haven't done this yet since this file has other models as well ( this was brought in as it is from the e-mission-eval-private-data ). I can try to move other models to a separate file ( along with their blame history) and then change its name.

shankari · 2024-02-12T06:21:54Z

emission/analysis/modelling/trip_model/dbscan_svm.py

I assume this file is currently unused, it is certainly not tested. Is there a reason we need to include it in this PR?

yes, this is not needed. The file dbscan_svm.py file was created before we moved all models related code from e-mission-eval-private-data. After moving, this code is already present in models.py. Will remove this file.

shankari · 2024-02-12T06:28:44Z

emission/analysis/modelling/trip_model/forest_classifier.py

+            'max_features',
+            'bootstrap',
+        ]            
+        cluster_expected_keys= [


why do we have cluster_expected_keys in the forest classifier file? The forest classifier should be separate from the cluster classifier.

This was used when we cluster the coordinates (loc_cluster parameter = True) before passing to Random Forest. However, since we are working with configuration where loc_cluster is False with RF , I'll mention the same there and comment this and other related parts.

shankari · 2024-02-12T06:30:43Z

emission/analysis/modelling/trip_model/forest_classifier.py

+        self.model=ForestClassifier( loc_feature=config['loc_feature'],
+                                     radius= config['radius'],
+                                     size_thresh=config['radius'],
+                                     purity_thresh=config['purity_thresh'],
+                                     gamma=config['gamma'],
+                                     C=config['C'],
+                                     n_estimators=config['n_estimators'],
+                                     criterion=config['criterion'],
+                                     max_depth=maxdepth, 
+                                     min_samples_split=config['min_samples_split'],
+                                     min_samples_leaf=config['min_samples_leaf'],
+                                     max_features=config['max_features'],
+                                     bootstrap=config['bootstrap'],
+                                     random_state=config['random_state'],
+                                     # drop_unclustered=False,
+                                     use_start_clusters=config['use_start_clusters'],
+                                     use_trip_clusters=config['use_trip_clusters'])


Isn't this just boilerplate code to copy parameters from one datastructure to another? Why not use something like kwargs instead?

yeah, I have used kwargs now.

shankari · 2024-02-12T18:38:01Z

emission/tests/modellingTests/TestForestModelLoadandSave.py

+        # test data can be saved between test invocations, check if data exists before generating
+        ts = esta.TimeSeries.get_time_series(user_id)
+        test_data = list(ts.find_entries(["analysis/confirmed_trip"]))  


why would test data be saved between test invocations? The tearDown method should clean up after the test is run

True. Here we are checking if there were any trips left after tear down of last run. The comment was misleading. I have changed the comment to imply the same.

shankari · 2024-02-12T18:38:56Z

emission/tests/modellingTests/TestForestModelLoadandSave.py

+            test_data = esda.get_entries(key="analysis/confirmed_trip", user_id=user_id, time_query=None)
+            if len(test_data) != self.total_trips:
+                logging.debug(f'test invariant failed after generating test data')
+                self.fail()
+            else:


moving this out of the setup would also ensure that we don't have to call self.fall()

Fixed an error on line 59.

moving this out of the setup would also ensure that we don't have to call self.fall()

Since this is a part of setup, wouldn't it be better if we ensure that the setup is successful and then proceed towards our tests?

shankari · 2024-02-12T19:01:10Z

emission/tests/modellingTests/TestForestModelLoadandSave.py

+        model_data=model.to_dict()
+        loaded_model_type=eamumt.ModelType.RANDOM_FOREST_CLASSIFIER
+        loaded_model = loaded_model_type.build(self.forest_model_config)
+        loaded_model.from_dict(model_data)


I am confused here. The model from load_stored_trip_model is called model and the model generated by building is loaded_model?

Improved with with better identifiers

shankari · 2024-02-12T19:33:28Z

emission/tests/modellingTests/TestForestModelLoadandSave.py

+        for attr in ['purpose_predictor','mode_predictor','replaced_predictor','purpose_enc','mode_enc','train_df']:
+            assert isinstance(getattr(loaded_model.model,attr),type(getattr(model.model,attr)))
+


why are you only checking types instead of actual values?

Included a pre and post serialization-deserialisation prediction assertion on a random user now.

shankari · 2024-02-12T19:35:46Z

emission/tests/modellingTests/TestRunForestModel.py

+        eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()
+        # success if it didn't throw


why not get the model and check some properties instead of just assuming success if it didn't throw?
e.g. something like the below (pseudo-code, do not blindly merge)

Suggested change

eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()

# success if it didn't throw

built_model = eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()

self.assertEqual(build_model.get_type(), eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.text)

# success if it didn't throw

I am now checking the type of built model, along with the types of other 5 attributes.

1. Improved tests in `TestForestModelLoadandSave.py` 2. Better comments, imports nd cleanup

shankari · 2024-03-16T01:19:27Z

Tagging @JGreenlee this is an MLOps change, we are changing the survey assist algorithm to use random forest instead of the old binning algorithm. As you can see, the code is set up to have a common structure for survey assist algorithms (find trips, train, save the model) and (load model, find prediction, store prediction). LMK if you need help understanding the structure.

We already had an implementation for the binning algorithm, this is an implementation of a second algorithm. LMK if you have any high level questions related to this.

Note that one UI implication for this is that, after the change, we will always have inferred labels for trips. With the previous binning algorithm, if it was a novel trip, there would not be any matches. But the random forest will generate predictions for every input, hopefully flagging that it is crappy. I believe we already ignore predictions with too low probability, but we should verify that it works properly end to end.

Again, LMK if you have high-level questions on the underlying models

While testing model integration, 2 forest model features specific features are added in the `TestForestModelIntegration.py` file rather than in `entry.py` file.

2 more ( total 4) Forest model specific features are now added after generating random trips for testing purpose.

Forst model specific values added in test setup for random Trips

test in testRandomForestTypePreservation as using `allceodata` specific user id. The tests on github use different db. Fixed by generating random samples.

shankari · 2024-10-29T20:10:55Z

@MukuFlash03 I would suggest:

starting with a review of the existing code
cross-correlating it with my comments
addressing my comments

Sending it over to @JGreenlee for peer review before it comes it me.

1. Newline fixes - reverted removal of line a. util.py e-mission#938 (comment) b. run_model.py e-mission#938 (comment) 2. Import format changed to "import X as Y" instead of "from X import Y" Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py

1. Newline fixes - reverted removal of line a. util.py e-mission#938 (comment) b. run_model.py e-mission#938 (comment) 2. Import format changed to "import X as Y" instead of "from X import Y" Files: clustering.py, models.py, TestForestModelIntegration.py, TestForestModelLoadandSave.py, TestRunForestMode.py e-mission#938 (comment)

a. Removed check for remaining test data from previous test runs. - This should not be possible if data is cleared correctly in tearDown(). - Improved database clearing in tearDown() just to be sure. b. Moved model build to setup() since all tests need this - I did see Shankari's comment stating that model building is a heavyweight process (e-mission@104dd9a#r1486605432) - But it is anyways required by all tests and moving it to setup helps reduce duplicate code. c. Merged EqualityTest with Type Preservation test - Shankari had left a comment to check for values versus checking for types (e-mission#938 (comment)). - Satyam had added changes to check the predictions list after serialization and deserialization respectively. - However this equality test was already being done in a previous test. - Hence merged these two. d. Merged Serialization and Deserialization error handling test. - These tests were identical and mock functions were being used to assert raised exceptions. - Merged these as well. Merging tests helps reduce the number of types we have to build the model as for all tests the common steps involve building model and fetching predictions.

…tance variables 1. Split up mock trips data into train / test data. - Saw that this was being done in one of the tests in TestForestModelLoadandSave.py itself as well as in TestGreedySimilarityBinning.py - Hence added it for all tests in forest model tests for uniformity. 2. Reduced number of instance variables since they were used inside setUp() only. This addresses review comment mentioned originally for TestForestModelIntegration e-mission#938 (comment) 3. Cleaned up TestForestModeIntegration.py - Added equality tests that check for prediction values generated in pipeline. Address review comment: e-mission#938 (comment) - Added train / test data split. - Removed check for empty data in setUp() Addresses review comment: e-mission#938 (comment)

hlu109 and others added 8 commits August 12, 2022 19:33

add copy of code used in TRB paper

6471a96

update user uuid lookup; add documentation note

9d6a1af

Add additional logging to the calculation so that we can monitor the …

9bd9b18

…result generation more carefully

Unified Interface for fit function

e7d2a14

Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions. Default set to 'none'.

[PARTIALLY TESTED] Single database read and Code Cleanuo

0adb5fe

1. removed mentions of `tour_model` or `tour_model_first_only` . 2. removed two reads from database. 3. Removed notebook outputs ( this could be the reason a few diffs are too big to view)

[PARTIALLY TESTED] Survey Assist Using RF

e9abd51

RF initialisation and fit function. Build test written and tested. fit of Random forest uses df .

humbleOldSage marked this pull request as draft October 3, 2023 00:30

humbleOldSage added 2 commits October 3, 2023 09:26

[NOT TESTED]Predict implemented

3820d87

Predict is now included. Just need to figure out model storage and model testing.

[NOT TESTED] Model storage and Model Testing included

5b2572e

Model loading and storing is now improved since it just stores the required predictors and encoders. Regression test and null value test included in tests.

shankari requested changes Oct 21, 2023

View reviewed changes

humbleOldSage requested a review from shankari November 2, 2023 15:33

humbleOldSage added 2 commits November 2, 2023 18:00

Minor fixes

1d7be5a

Removed redundancies and unnecessary code segments.

Delete Config file

b3d0db2

Config copy not required.

shankari reviewed Nov 3, 2023

View reviewed changes

shankari requested changes Nov 3, 2023

View reviewed changes

humbleOldSage added 5 commits November 3, 2023 17:38

Merge remote-tracking branch 'e-mission-eval-private-data/move-models…

c514fe0

…' into SurveyAssistgetModel

removedfile

3b038a9

Update model.py

94fc848

Merge branch 'master' into SurveyAssistUsingRandomForest

87f109c

Updating branch with changes from master. Moved mdoels from eval repo to server repo.

humbleOldSage commented Dec 9, 2023

View reviewed changes

humbleOldSage added 3 commits December 9, 2023 00:25

minor fix

01fcb2a

Delete model.py

f5fec64

This is replaced by models.py (movied with history)

Update TestForestModel.py

585cc90

Improving the test file by changing the way previpous predictions are stored.

Minor Fixes

61bbe3f

Fixing circular import

[Tested]Adding Integration test

a32ce4f

Improving test

052cb08

humbleOldSage commented Jan 10, 2024

View reviewed changes

humbleOldSage marked this pull request as ready for review February 5, 2024 06:32

shankari requested changes Feb 12, 2024

View reviewed changes

[Tested] Improvements for model integration

1b523ed

1. Improved tests in `TestForestModelLoadandSave.py` 2. Better comments, imports nd cleanup

humbleOldSage added 4 commits March 20, 2024 23:44

Forest Model related data additions

35a1346

While testing model integration, 2 forest model features specific features are added in the `TestForestModelIntegration.py` file rather than in `entry.py` file.

Update TestForestModelIntegration.py

19bb394

2 more ( total 4) Forest model specific features are now added after generating random trips for testing purpose.

[TESTED] Updated ForestModelLoadAndSave.py

450094c

Forst model specific values added in test setup for random Trips

Fixing TestForestModelLoadandSave.py

ad968de

test in testRandomForestTypePreservation as using `allceodata` specific user id. The tests on github use different db. Fixed by generating random samples.

MukuFlash03 mentioned this pull request Nov 22, 2024

Cleanup PR for existing Survey Assist RF PR #938 #995

Open

		@@ -57,7 +57,6 @@ def update_trip_model(
		logging.debug(f'time query for training data collection: {time_query}')

		trips = _get_training_data(user_id, time_query)

		"""these tests were copied forward during a refactor of the tour model
		[https://github.com/e-mission/e-mission-server/blob/10772f892385d44e11e51e796b0780d8f6609a2c/emission/analysis/modelling/tour_model_first_only/load_predict.py#L114]

		# for entry in entries:
		# self.assertGreater(len(entry["data"]["prediction"]), 0)

		for attr in ['purpose_predictor','mode_predictor','replaced_predictor','purpose_enc','mode_enc','train_df']:
		assert isinstance(getattr(loaded_model.model,attr),type(getattr(model.model,attr)))

		eamumt.ModelType.RANDOM_FOREST_CLASSIFIER.build()
		# success if it didn't throw

Survey Assist Using RF #938

Are you sure you want to change the base?

Survey Assist Using RF #938

Conversation

humbleOldSage commented Oct 2, 2023 • edited Loading

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

shankari Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

shankari left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage commented Nov 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage commented Dec 9, 2023

humbleOldSage commented Dec 13, 2023

humbleOldSage commented Dec 16, 2023

humbleOldSage commented Jan 2, 2024

humbleOldSage commented Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

shankari commented Feb 2, 2024

humbleOldSage commented Feb 5, 2024

humbleOldSage commented Feb 5, 2024

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage commented Oct 2, 2023 •

edited

Loading

humbleOldSage Mar 5, 2024 •

edited

Loading

shankari Nov 3, 2023 •

edited

Loading

shankari Nov 3, 2023 •

edited

Loading

humbleOldSage Nov 9, 2023 •

edited

Loading

shankari Nov 3, 2023 •

edited

Loading

shankari left a comment •

edited

Loading

humbleOldSage Nov 9, 2023 •

edited

Loading

humbleOldSage commented Jan 10, 2024 •

edited

Loading

humbleOldSage Mar 14, 2024 •

edited

Loading

humbleOldSage Feb 29, 2024 •

edited

Loading

humbleOldSage Feb 29, 2024 •

edited

Loading

humbleOldSage Mar 5, 2024 •

edited

Loading