Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update clustering.py #37

Merged
merged 21 commits into from
Nov 25, 2023
Merged

Conversation

humbleOldSage
Copy link
Contributor

@humbleOldSage humbleOldSage commented Aug 11, 2023

Changes in clustering.py file to shift dependency from hlu09's tour_model_extended to main branch's trip_model. Still need to change type of data being passed to fit function for this to work. Marked with a TODO. Explained in detail at #35 (comment)

Changes in clustering.py file to shift dependency from hlu09's  tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work.
All dependencies of this notebook from  custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook.
@humbleOldSage humbleOldSage requested a review from shankari August 16, 2023 00:21
With these changes, no change in e-mission-server should be required.
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better....

Have you run the code?
Please indicate "testing done".
Do you get the same graphs as the paper?
Note that the check is not "do I get a graph", it is "do I get the same graph"

TRB_label_assist/clustering.py Outdated Show resolved Hide resolved
TRB_label_assist/clustering.py Outdated Show resolved Hide resolved
TRB_label_assist/clustering.py Outdated Show resolved Hide resolved
TRB_label_assist/clustering.py Outdated Show resolved Hide resolved
TRB_label_assist/clustering.py Outdated Show resolved Hide resolved
TRB_label_assist/clustering.py Show resolved Hide resolved
@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Aug 17, 2023

Have you run the code?

Yes

Do you get the same graphs as the paper?

Yet to confirm

Please indicate "testing done".

Ongoing. The way I am planning to test this is I'll match and compare labels generated by both custom branch and master branch. This will verify that master branch and custom branch are functioning similarly.

Is there any other way I can test this ?

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Aug 17, 2023

Do you get the same graphs as the paper?

They differ. Let me check why this is happening.

passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values,  'origin','destination' or 'origin-destination'.
@humbleOldSage
Copy link
Contributor Author

Tested. This is running with no errors.
Can confirm this generates the same results.

@humbleOldSage humbleOldSage requested a review from shankari August 20, 2023 16:59
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular change seems fine if it works.

  • Turned out to be pretty simple after all?!
  • I would like to see more information in the PR issue that it works (screenshots, information about the model indicating that it works)
  • is this the only notebook that is affected by the change? I know that we have a notebook which generates the performance (accuracy/F-score) of various algorithms. I would expect that it would also need to be changed...

TRB_label_assist/clustering.py Show resolved Hide resolved
@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Aug 20, 2023

  • is this the only notebook that is affected by the change? I know that we have a notebook which generates the performance (accuracy/F-score) of various algorithms. I would expect that it would also need to be changed...

Almost All the other notebooks have dependencies on this module

@humbleOldSage humbleOldSage marked this pull request as draft August 20, 2023 19:31
previous suggestions to improve readability.
This reverts commit 3e19b32.
@humbleOldSage
Copy link
Contributor Author

  • I would like to see more information in the PR issue that it works (screenshots, information about the model indicating that it works)

Screenshot from the latest run, so no errors.

screencapture-localhost-8888-notebooks-clustering-examples-ipynb-2023-08-20-16_14_14

@humbleOldSage
Copy link
Contributor Author

Left is current result. Right is from research paper.
Suburban 50m.

Suburban 100m

Suburban 150m

@humbleOldSage
Copy link
Contributor Author

Left is current result. Right is from research paper.
College 50m.

College 100m

College 150m

Suggestions from previous comments to improve readability.
@humbleOldSage humbleOldSage marked this pull request as ready for review August 20, 2023 23:08
@humbleOldSage humbleOldSage requested a review from shankari August 20, 2023 23:09
@humbleOldSage humbleOldSage marked this pull request as draft August 22, 2023 03:25
@humbleOldSage humbleOldSage marked this pull request as ready for review August 22, 2023 18:45
…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model

`cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  now have no dependence on the custom branch. Results of plots  are attached to show no difference in theie previous and current outputs.
@humbleOldSage humbleOldSage marked this pull request as draft August 22, 2023 20:17
`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works.  Other files, like models.py will be tested once  any of the above two are run.
[Partially Tested] Suggested changes implemented
bb404e9
`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run.
@humbleOldSage
Copy link
Contributor Author

Since this is partially tested, I'll keep the PR as draft, as soon as I have completed the final testing, I'll mark it as ready to merge.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done, just a few minor changes

TRB_label_assist/SVM_decision_boundaries.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/SVM_decision_boundaries.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/models.py Outdated Show resolved Hide resolved
TRB_label_assist/models.py Show resolved Hide resolved
TRB_label_assist/models.py Outdated Show resolved Hide resolved
TRB_label_assist/models.py Outdated Show resolved Hide resolved
Fixed names of variables to be more self-explanatory
@humbleOldSage
Copy link
Contributor Author

Not tested. Needs Testing.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even smaller cleanups

TRB_label_assist/clustering_examples.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/clustering_examples.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/generate_figs_for_poster.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/get_performance_for_poster.ipynb Outdated Show resolved Hide resolved
1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server

2.Minor fixes
@humbleOldSage
Copy link
Contributor Author

generate_figs_from_poster.ipynb :

plot after latest testing

Screenshot 2023-11-16 at 12 01 08 PM

snap from the research paper :
Screenshot 2023-11-16 at 3 40 13 PM

plot after latest testing
Screenshot 2023-11-16 at 12 01 40 PM

snaps from the research paper :
Screenshot 2023-11-16 at 3 43 06 PM

@humbleOldSage
Copy link
Contributor Author

generate_figs_for_poster.ipynb :

On the left are Plots after current testing, on the right are images from runs of notebook with @hlu109 custom branch :

naive fixed-width clustering from the first user's data

150m

50m

100m

DBSCAN without SVM: home cluster with a blue cluster to the south that was merged in

DBSCAN + SVM: home cluster and blue cluster to the south have been separated

@humbleOldSage
Copy link
Contributor Author

Clustering_example.ipynb

Left is current test result. Right is from research paper.
Suburban 50m.

Suburban 100m

Suburban 150m

Left is current result. Right is from research paper.
College 50m.

College 100m

College 150m

@humbleOldSage
Copy link
Contributor Author

SVM_decision_boundary.ipynb :

On the left are plots from current test, on the right are plots from old runs :

@humbleOldSage
Copy link
Contributor Author

get_cluster_performance.ipynb :

For each pair, top one is the result of current test, bottom one is result from older runs :

output2

Screen Shot 2023-08-22 at 1 05 03 AM

output3

Screen Shot 2023-08-22 at 1 05 34 AM

output4

Screen Shot 2023-08-22 at 1 05 44 AM

output5

Screen Shot 2023-08-22 at 1 05 52 AM

@humbleOldSage humbleOldSage marked this pull request as ready for review November 16, 2023 22:45
@humbleOldSage
Copy link
Contributor Author

All model results :

Screenshot 2023-11-16 at 5 53 00 PM

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humbleOldSage I don't think you have addressed several of the prior review comments. They are fairly simple, so you might have just missed them - please check the review history carefully.

Please make sure that all comments are addressed before marking as ready for review.

TRB_label_assist/clustering_examples.ipynb Outdated Show resolved Hide resolved
TRB_label_assist/models.py Show resolved Hide resolved
@@ -378,13 +405,19 @@ def _distance_helper(self, tripa, tripb, loc_type):

copied from the Similarity class on the e-mission-server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are no longer on a custom branch, can't the distance calculation in e-mission-server be re-used here? Why do we need copy-pasted code? I am generally again copy-pasting to support DRY. Since this change has already had multiple revisions, I am OK with deferring this to the next PR, but I want to make sure that it is not forgotten.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are indeed using e-mission-server.

Line 416:

  dist= ecc.calDistance([pta_lon,pta_lat],[ptb_lon,ptb_lat])     

ecc here is on e-mission-server :

  import emission.core.common as ecc

That's an old comment from Hannah that we should remove.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While predicting in the greedy_similarity_binning.py on e-mission-server, the flow goes like :

predict -> _nearest_bin ->similar ( in e-mission-server/emission/analysis/modelling/similarity/similarity_metric.py) ->similarity ( in e-mission-server/emission/analysis/modelling/similarity/od_similarity.py) -> ecc.calDistance.

Ans so currently we are using this ecc.calDistance

Copy link
Contributor

@shankari shankari Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you have misunderstood the comment. The comment is not about ecc.calDistance - it is clear that calDistance is from
This says that _distanceHelper is copied from e-mission-server, which is it

$ grep -r distance_helper emission/
emission//analysis/modelling/tour_model_first_only_orig/similarity.py:            if not self.distance_helper(a, b):
emission//analysis/modelling/tour_model_first_only_orig/similarity.py:    def distance_helper(self, a, b):
emission//analysis/modelling/tour_model/similarity.py:            if not self.distance_helper(a, b):
emission//analysis/modelling/tour_model/similarity.py:    def distance_helper(self, a, b):
emission//incomplete_tests/TestSimilarity.py:                    self.assertTrue(sim.distance_helper(b,c))

However, the implementation does seem to be a bit different, and I don't see a function with this name in trip_model. There must be an equivalent in trip_model to calculate distances between trips, which we should reuse here.

Copy link
Contributor Author

@humbleOldSage humbleOldSage Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than ecc.calDistance, there just pre-processing in _distance_helper in the form of coordinate extraction :

        #tripa is taken from the test datframe. 
        #tripb is taken from the stored bin list.
        
        pta_lat = tripa[[loc_type + '_lat']]
        pta_lon = tripa[[loc_type + '_lon']]
        if loc_type == 'start':
            ptb_lat = tripb[1]
            ptb_lon = tripb[0]
        elif loc_type == 'end':
            ptb_lat = tripb[3]
            ptb_lon = tripb[2]

We do have extract_features function ( at e-mission-server/emission/analysis/modelling/similarity/od_similarity.py) on e-mission-server that extracts latitude and longitude of trips, but it works just for Entry type data ( since data frames are not used in e-mission-server).

There must be an equivalent in trip_model to calculate distances between trips, which we should reuse here.

For the reason above, there isn't. However, this is what we can do to use the _nearest_bin function ( e-mission-server/emission/analysis/modelling/trip_model/greedy_similarity_binning.py) which is closest to _distance_helper :

  1. convert tripa ( the test trip) to entry type data using df_row_to_entry ( in e-mission-server/emission/storage/timeseries/builtin_timeseries.py).
  2. Pass this entry to the _nearest_bin function.

Let me know if this works, and I'll test this.

Copy link
Contributor

@shankari shankari Nov 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #37 (comment)

Since this change has already had multiple revisions, I am OK with deferring this to the next PR, but I want to make sure that it is not forgotten.

I am fine with returning to this later. However, eventually, we should simplify the codebase to either use only dataframes, or use only entries, or, if we are going to support some level of mix and match, have the utility functions support both combinations.

Can you please file an issue for this so that we don't forget it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed here #39 .

Minor Fixes to improve readability.
@shankari
Copy link
Contributor

@humbleOldSage two more comments.

Improved readability
@shankari
Copy link
Contributor

Squash-merging since this is 21 commits for some fairly simple changes.
@humbleOldSage please account for this while making any future changes.

@shankari shankari merged commit 8d27847 into e-mission:master Nov 25, 2023
humbleOldSage added a commit to humbleOldSage/e-mission-eval-private-data that referenced this pull request Dec 1, 2023
* Update clustering.py

Changes in clustering.py file to shift dependency from hlu09's  tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work.

* moving clustering_examples.ipynb to trip_model

All dependencies of this notebook from  custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook.

* Removing changes in builtimeseries.py

With these changes, no change in e-mission-server should be required.

* Changes to support TRB_Label_Assist

passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values,  'origin','destination' or 'origin-destination'.

* suggestions

previous suggestions to improve readability.

* Revert "suggestions"

This reverts commit 3e19b32.

* Improving readability

Suggestions from previous comments to improve readability.

* making `cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model

`cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  now have no dependence on the custom branch. Results of plots  are attached to show no difference in theie previous and current outputs.

* Unified Interface for fit function

Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions.  Default set to 'none'.

* Fixing `models.py` to support `regenerate_classification_performance_results.py`

Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual.

 Other minor fixes to support previous changes.

* [PARTIALLY TESTED] Single database read and   Code Cleanuo

1. removed mentions of `tour_model` or `tour_model_first_only` .

2. removed two reads from database.

3. Removed notebook outputs  ( this could be the reason a few diffs are too big to view)

* Delete TRB_label_assist/first_trial_results/cv results DBSCAN+SVM (destination).csv

not required.

* Reverting Notebook

Reverting notebooks to initial state, since running on the browser messed up the cell index numbers.  This was causing unnecessary git diffs even when no changes were made. running on VS code should resolve this. WIll do the subsequent changes on VS code and commit again.

* [Partially Tested]Handled Whitespaces

Whitespaces corrected.

* [Partially Tested] Suggested changes implemented

`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works.  Other files, like models.py will be tested once  any of the above two are run.

* Revert "[Partially Tested] Suggested changes implemented"

This reverts commit bb404e9.

* [Partially Tested] Suggested changes implemented

[Partially Tested] Suggested changes implemented
bb404e9
`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run.

* Minor variable fixes

Fixed names of variables to be more self-explanatory

* [TESTED] All the notebooks and files are tested

1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server

2.Minor fixes

* Minor Fixes

Minor Fixes to improve readability.

* Minor Fixes in models.py

Improved readability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants