Skip to content

Commit

Permalink
Update clustering.py (#37)
Browse files Browse the repository at this point in the history
* Update clustering.py

Changes in clustering.py file to shift dependency from hlu09's  tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work.

* moving clustering_examples.ipynb to trip_model

All dependencies of this notebook from  custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook.

* Removing changes in builtimeseries.py

With these changes, no change in e-mission-server should be required.

* Changes to support TRB_Label_Assist

passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values,  'origin','destination' or 'origin-destination'.

* suggestions

previous suggestions to improve readability.

* Revert "suggestions"

This reverts commit 3e19b32.

* Improving readability

Suggestions from previous comments to improve readability.

* making `cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model

`cluster_performance.ipynb`, `generate_figs_for_poster` and  `SVM_decision_boundaries`  now have no dependence on the custom branch. Results of plots  are attached to show no difference in theie previous and current outputs.

* Unified Interface for fit function

Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions.  Default set to 'none'.

* Fixing `models.py` to support `regenerate_classification_performance_results.py`

Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual.

 Other minor fixes to support previous changes.

* [PARTIALLY TESTED] Single database read and   Code Cleanuo

1. removed mentions of `tour_model` or `tour_model_first_only` .

2. removed two reads from database.

3. Removed notebook outputs  ( this could be the reason a few diffs are too big to view)

* Delete TRB_label_assist/first_trial_results/cv results DBSCAN+SVM (destination).csv

not required.

* Reverting Notebook

Reverting notebooks to initial state, since running on the browser messed up the cell index numbers.  This was causing unnecessary git diffs even when no changes were made. running on VS code should resolve this. WIll do the subsequent changes on VS code and commit again.

* [Partially Tested]Handled Whitespaces

Whitespaces corrected.

* [Partially Tested] Suggested changes implemented

`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works.  Other files, like models.py will be tested once  any of the above two are run.

* Revert "[Partially Tested] Suggested changes implemented"

This reverts commit bb404e9.

* [Partially Tested] Suggested changes implemented

[Partially Tested] Suggested changes implemented
bb404e9
`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run.

* Minor variable fixes

Fixed names of variables to be more self-explanatory

* [TESTED] All the notebooks and files are tested

1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server

2.Minor fixes

* Minor Fixes

Minor Fixes to improve readability.

* Minor Fixes in models.py

Improved readability
  • Loading branch information
humbleOldSage authored Nov 25, 2023
1 parent 97bcc3a commit 8d27847
Show file tree
Hide file tree
Showing 11 changed files with 222 additions and 89 deletions.
7 changes: 6 additions & 1 deletion TRB_label_assist/SVM_decision_boundaries.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
"import emission.storage.timeseries.abstract_timeseries as esta\n",
"import emission.storage.decorations.trip_queries as esdtq\n",
"import emission.core.get_database as edb\n",
"import emission.analysis.modelling.trip_model.run_model as eamtr\n",
"\n",
"import data_wrangling\n",
"from clustering import add_loc_clusters"
Expand Down Expand Up @@ -60,10 +61,12 @@
"uuids = [suburban_uuid, college_campus_uuid]\n",
"confirmed_trip_df_map = {}\n",
"labeled_trip_df_map = {}\n",
"ct_entry={}\n",
"expanded_trip_df_map = {}\n",
"for u in uuids:\n",
" ts = esta.TimeSeries.get_time_series(u)\n",
" ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
" ct_entry[u]=eamtr._get_training_data(u,None)\n",
" ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])\n",
" confirmed_trip_df_map[u] = ct_df\n",
" labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
" expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
Expand Down Expand Up @@ -110,6 +113,8 @@
" df_for_cluster = all_trips_df if cluster_unlabeled else labeled_trips_df\n",
"\n",
" df_for_cluster = add_loc_clusters(df_for_cluster,\n",
" ct_entry,\n",
" clustering_way='destination',\n",
" radii=radii,\n",
" alg=alg,\n",
" loc_type=loc_type,\n",
Expand Down
9 changes: 5 additions & 4 deletions TRB_label_assist/classification_performance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,14 @@
"import pandas as pd\n",
"import numpy as np\n",
"from uuid import UUID\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# import logging\n",
"# logging.basicConfig(level=logging.DEBUG)\n",
"\n",
"import emission.storage.timeseries.abstract_timeseries as esta\n",
"import emission.storage.decorations.trip_queries as esdtq\n",
"\n",
"import emission.analysis.modelling.trip_model.run_model as eamtr\n",
"from performance_eval import get_clf_metrics, cv_for_all_algs, PREDICTORS"
]
},
Expand All @@ -49,10 +48,11 @@
"labeled_trip_df_map = {}\n",
"expanded_labeled_trip_df_map = {}\n",
"expanded_all_trip_df_map = {}\n",
"ct_entry={}\n",
"for u in all_users:\n",
" ts = esta.TimeSeries.get_time_series(u)\n",
" ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
"\n",
" ct_entry[u]=eamtr._get_training_data(u,None)\n",
" ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])\n",
" confirmed_trip_df_map[u] = ct_df\n",
" labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
" expanded_labeled_trip_df_map[u] = esdtq.expand_userinputs(\n",
Expand Down Expand Up @@ -132,6 +132,7 @@
"# load in all runs\n",
"model_names = list(PREDICTORS.keys())\n",
"cv_results = cv_for_all_algs(\n",
" ct_entry,\n",
" uuid_list=all_users,\n",
" expanded_trip_df_map=expanded_labeled_trip_df_map,\n",
" model_names=model_names,\n",
Expand Down
12 changes: 8 additions & 4 deletions TRB_label_assist/cluster_performance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,10 @@
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from matplotlib.gridspec import GridSpec\n",
"\n",
"import emission.analysis.modelling.trip_model.run_model as eamtr\n",
"import emission.storage.timeseries.abstract_timeseries as esta\n",
"import emission.storage.decorations.trip_queries as esdtq\n",
"import performance_eval\n",
Expand All @@ -45,10 +44,11 @@
"labeled_trip_df_map = {}\n",
"expanded_labeled_trip_df_map = {}\n",
"expanded_all_trip_df_map = {}\n",
"ct_entry={}\n",
"for u in all_users:\n",
" ts = esta.TimeSeries.get_time_series(u)\n",
" ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
"\n",
" ct_entry[u]=eamtr._get_training_data(u,None) \n",
" ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u]) \n",
" confirmed_trip_df_map[u] = ct_df\n",
" labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
" expanded_labeled_trip_df_map[u] = esdtq.expand_userinputs(\n",
Expand Down Expand Up @@ -87,6 +87,8 @@
"\n",
" all_results_df = performance_eval.run_eval_cluster_metrics(\n",
" expanded_labeled_trip_df_map,\n",
" ct_entry,\n",
" clustering_way='destination',\n",
" user_list=all_users,\n",
" radii=radii,\n",
" loc_type='end',\n",
Expand Down Expand Up @@ -265,6 +267,8 @@
"\n",
"SVM_results_df = performance_eval.run_eval_cluster_metrics(\n",
" expanded_labeled_trip_df_map,\n",
" ct_entry,\n",
" clustering_way=\"destination\",\n",
" user_list=all_users,\n",
" radii=radii,\n",
" loc_type='end',\n",
Expand Down
47 changes: 37 additions & 10 deletions TRB_label_assist/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# our imports
# NOTE: this requires changing the branch of e-mission-server to
# eval-private-data-compatibility
import emission.analysis.modelling.tour_model_extended.similarity as eamts
import emission.storage.decorations.trip_queries as esdtq
import emission.analysis.modelling.trip_model.greedy_similarity_binning as eamtg

EARTH_RADIUS = 6371000
ALG_OPTIONS = [
Expand All @@ -28,9 +28,27 @@
'mean_shift'
]

def cleanEntryTypeData(loc_df,trip_entry):

"""
Helps weed out entries from the list of entries which were removed from the df using
esdtq.filter_labeled_trips() and esdtq.expand_userinputs()
loc_df : dataframe amde from entry type data
trip_entry : the entry type equivalent of loc_df ,
which was passed alongside the dataframe while loading the data
"""

ids_in_df=loc_df['_id']
filtered_trip_entry = list(filter(lambda entry: entry['_id'] in ids_in_df.values, trip_entry))
return filtered_trip_entry


def add_loc_clusters(
loc_df,
trip_entry,
clustering_way,
radii,
loc_type,
alg,
Expand All @@ -53,6 +71,9 @@ def add_loc_clusters(
Args:
loc_df (dataframe): must have columns 'start_lat' and 'start_lon'
or 'end_lat' and 'end_lon'
trip_entry ( list of Entry/confirmedTrip): list consisting all entries from the
time data was loaded. loc_df was obtained from this by converting to df and
then filtering out labeled trips and expanding user_inputs
radii (int list): list of radii to run the clustering algs with
loc_type (str): 'start' or 'end'
alg (str): 'DBSCAN', 'naive', 'OPTICS', 'SVM', 'fuzzy', or
Expand Down Expand Up @@ -98,19 +119,25 @@ def add_loc_clusters(
loc_df.loc[:, f"{loc_type}_DBSCAN_clusters_{r}_m"] = labels

elif alg == 'naive':

cleaned_trip_entry= cleanEntryTypeData(loc_df,trip_entry)

for r in radii:
# this is using a modified Similarity class that bins start/end
# points separately before creating trip-level bins
sim_model = eamts.Similarity(loc_df,
radius_start=r,
radius_end=r,
shouldFilter=False,
cutoff=False)
# we only bin the loc_type points to speed up the alg. avoid
# unnecessary binning since this is really slow
sim_model.bin_helper(loc_type=loc_type)
labels = sim_model.data_df[loc_type + '_bin'].to_list()

model_config = {
"metric": "od_similarity",
"similarity_threshold_meters": r, # meters,
"apply_cutoff": False,
"clustering_way": clustering_way,
"shouldFilter":False,
"incremental_evaluation": False
}

sim_model = eamtg.GreedySimilarityBinning(model_config)
sim_model.fit(cleaned_trip_entry)
labels = [int(l) for l in sim_model.tripLabels]
# # pd.Categorical converts the type from int to category (so
# # numerical operations aren't possible)
# loc_df.loc[:, f"{loc_type}_{alg}_clusters_{r}_m"] = pd.Categorical(
Expand Down
21 changes: 17 additions & 4 deletions TRB_label_assist/clustering_examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,11 @@
"%autoreload 2\n",
"\n",
"from uuid import UUID\n",
"\n",
"import emission.storage.timeseries.abstract_timeseries as esta\n",
"import emission.storage.decorations.trip_queries as esdtq\n",
"import emission.core.get_database as edb\n",
"\n",
"import mapping"
"import emission.analysis.modelling.trip_model.run_model as eamtr\n",
"import mapping\n"
]
},
{
Expand Down Expand Up @@ -60,9 +59,11 @@
"confirmed_trip_df_map = {}\n",
"labeled_trip_df_map = {}\n",
"expanded_trip_df_map = {}\n",
"ct_entry={}\n",
"for u in uuids:\n",
" ts = esta.TimeSeries.get_time_series(u)\n",
" ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
" ct_entry[u]=eamtr._get_training_data(u,None) \n",
" ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u]) \n",
" confirmed_trip_df_map[u] = ct_df\n",
" labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
" expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
Expand All @@ -83,8 +84,10 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
" ct_entry[suburban_uuid],\n",
" alg='naive',\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150])\n",
Expand All @@ -98,8 +101,10 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
" ct_entry[college_campus_uuid],\n",
" alg='naive',\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150])\n",
Expand All @@ -121,9 +126,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
" ct_entry[suburban_uuid],\n",
" alg='DBSCAN',\n",
" SVM=False,\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150, 200])\n",
Expand All @@ -137,9 +144,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
" ct_entry[college_campus_uuid],\n",
" alg='DBSCAN',\n",
" SVM=False,\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150, 200])\n",
Expand All @@ -161,9 +170,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
" ct_entry[suburban_uuid],\n",
" alg='DBSCAN',\n",
" SVM=True,\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150, 200])\n",
Expand All @@ -177,9 +188,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
" ct_entry[college_campus_uuid],\n",
" alg='DBSCAN',\n",
" SVM=True,\n",
" loc_type='end',\n",
" clustering_way=\"destination\",\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150, 200])\n",
Expand Down
21 changes: 17 additions & 4 deletions TRB_label_assist/generate_figs_for_poster.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,14 @@
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn import svm\n",
"\n",
"import emission.storage.timeseries.abstract_timeseries as esta\n",
"import emission.storage.decorations.trip_queries as esdtq\n",
"import emission.core.get_database as edb\n",
"\n",
"import emission.analysis.modelling.trip_model.run_model as eamtr\n",
"import mapping\n",
"import data_wrangling\n",
"from clustering import add_loc_clusters"
Expand Down Expand Up @@ -67,9 +66,11 @@
"confirmed_trip_df_map = {}\n",
"labeled_trip_df_map = {}\n",
"expanded_trip_df_map = {}\n",
"ct_entry={}\n",
"for u in uuids:\n",
" ts = esta.TimeSeries.get_time_series(u)\n",
" ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
" ct_entry[u]=eamtr._get_training_data(u,None) \n",
" ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u]) \n",
" confirmed_trip_df_map[u] = ct_df\n",
" labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
" expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
Expand Down Expand Up @@ -98,8 +99,10 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[user1_uuid],\n",
" ct_entry[user1_uuid],\n",
" alg='naive',\n",
" loc_type='end',\n",
" clustering_way='destination',\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[50, 100, 150])\n",
Expand Down Expand Up @@ -137,9 +140,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[user2_uuid],\n",
" ct_entry[user2_uuid],\n",
" alg='DBSCAN',\n",
" SVM=False,\n",
" loc_type='end',\n",
" clustering_way='destination',\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[150])\n",
Expand All @@ -161,9 +166,11 @@
"outputs": [],
"source": [
"fig = mapping.find_plot_clusters(expanded_trip_df_map[user2_uuid],\n",
" ct_entry[user2_uuid],\n",
" alg='DBSCAN',\n",
" SVM=True,\n",
" loc_type='end',\n",
" clustering_way='destination',\n",
" plot_unlabeled=False,\n",
" cluster_unlabeled=False,\n",
" radii=[150])\n",
Expand Down Expand Up @@ -289,8 +296,14 @@
"\n",
" labeled_trips_df = all_trips_df.loc[all_trips_df.user_input != {}]\n",
" df_for_cluster = all_trips_df if cluster_unlabeled else labeled_trips_df\n",
"\n",
" if loc_type=='start':\n",
" clustering_way='origin'\n",
" else:\n",
" clustering_way='destination'\n",
" \n",
" df_for_cluster = add_loc_clusters(df_for_cluster,\n",
" ct_entry,\n",
" clustering_way=clustering_way,\n",
" radii=radii,\n",
" alg=alg,\n",
" loc_type=loc_type,\n",
Expand Down
Loading

0 comments on commit 8d27847

Please sign in to comment.