Update clustering.py (#37)

* Update clustering.py Changes in clustering.py file to shift dependency from hlu09's tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work. * moving clustering_examples.ipynb to trip_model All dependencies of this notebook from custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook. * Removing changes in builtimeseries.py With these changes, no change in e-mission-server should be required. * Changes to support TRB_Label_Assist passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values, 'origin','destination' or 'origin-destination'. * suggestions previous suggestions to improve readability. * Revert "suggestions" This reverts commit 3e19b32. * Improving readability Suggestions from previous comments to improve readability. * making `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` now have no dependence on the custom branch. Results of plots are attached to show no difference in theie previous and current outputs. * Unified Interface for fit function Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions. Default set to 'none'. * Fixing `models.py` to support `regenerate_classification_performance_results.py` Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual. Other minor fixes to support previous changes. * [PARTIALLY TESTED] Single database read and Code Cleanuo 1. removed mentions of `tour_model` or `tour_model_first_only` . 2. removed two reads from database. 3. Removed notebook outputs ( this could be the reason a few diffs are too big to view) * Delete TRB_label_assist/first_trial_results/cv results DBSCAN+SVM (destination).csv not required. * Reverting Notebook Reverting notebooks to initial state, since running on the browser messed up the cell index numbers. This was causing unnecessary git diffs even when no changes were made. running on VS code should resolve this. WIll do the subsequent changes on VS code and commit again. * [Partially Tested]Handled Whitespaces Whitespaces corrected. * [Partially Tested] Suggested changes implemented `Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run. * Revert "[Partially Tested] Suggested changes implemented" This reverts commit bb404e9. * [Partially Tested] Suggested changes implemented [Partially Tested] Suggested changes implemented bb404e9 `Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run. * Minor variable fixes Fixed names of variables to be more self-explanatory * [TESTED] All the notebooks and files are tested 1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server 2.Minor fixes * Minor Fixes Minor Fixes to improve readability. * Minor Fixes in models.py Improved readability
e-mission · Nov 25, 2023 · 8d27847 · 8d27847
1 parent 97bcc3a
commit 8d27847
Show file tree

Hide file tree

Showing 11 changed files with 222 additions and 89 deletions.
diff --git a/TRB_label_assist/SVM_decision_boundaries.ipynb b/TRB_label_assist/SVM_decision_boundaries.ipynb
@@ -30,6 +30,7 @@
                 "import emission.storage.timeseries.abstract_timeseries as esta\n",
                 "import emission.storage.decorations.trip_queries as esdtq\n",
                 "import emission.core.get_database as edb\n",
+                "import emission.analysis.modelling.trip_model.run_model as eamtr\n",
                 "\n",
                 "import data_wrangling\n",
                 "from clustering import add_loc_clusters"
@@ -60,10 +61,12 @@
                 "uuids = [suburban_uuid, college_campus_uuid]\n",
                 "confirmed_trip_df_map = {}\n",
                 "labeled_trip_df_map = {}\n",
+                "ct_entry={}\n",
                 "expanded_trip_df_map = {}\n",
                 "for u in uuids:\n",
                 "    ts = esta.TimeSeries.get_time_series(u)\n",
-                "    ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
+                "    ct_entry[u]=eamtr._get_training_data(u,None)\n",
+                "    ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])\n",
                 "    confirmed_trip_df_map[u] = ct_df\n",
                 "    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
                 "    expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
@@ -110,6 +113,8 @@
                 "    df_for_cluster = all_trips_df if cluster_unlabeled else labeled_trips_df\n",
                 "\n",
                 "    df_for_cluster = add_loc_clusters(df_for_cluster,\n",
+                "                                      ct_entry,\n",
+                "                                      clustering_way='destination',\n",
                 "                                      radii=radii,\n",
                 "                                      alg=alg,\n",
                 "                                      loc_type=loc_type,\n",

diff --git a/TRB_label_assist/classification_performance.ipynb b/TRB_label_assist/classification_performance.ipynb
@@ -19,15 +19,14 @@
                 "import pandas as pd\n",
                 "import numpy as np\n",
                 "from uuid import UUID\n",
-                "\n",
                 "import matplotlib.pyplot as plt\n",
                 "\n",
                 "# import logging\n",
                 "# logging.basicConfig(level=logging.DEBUG)\n",
                 "\n",
                 "import emission.storage.timeseries.abstract_timeseries as esta\n",
                 "import emission.storage.decorations.trip_queries as esdtq\n",
-                "\n",
+                "import emission.analysis.modelling.trip_model.run_model as eamtr\n",
                 "from performance_eval import get_clf_metrics, cv_for_all_algs, PREDICTORS"
             ]
         },
@@ -49,10 +48,11 @@
                 "labeled_trip_df_map = {}\n",
                 "expanded_labeled_trip_df_map = {}\n",
                 "expanded_all_trip_df_map = {}\n",
+                "ct_entry={}\n",
                 "for u in all_users:\n",
                 "    ts = esta.TimeSeries.get_time_series(u)\n",
-                "    ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
-                "\n",
+                "    ct_entry[u]=eamtr._get_training_data(u,None)\n",
+                "    ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])\n",
                 "    confirmed_trip_df_map[u] = ct_df\n",
                 "    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
                 "    expanded_labeled_trip_df_map[u] = esdtq.expand_userinputs(\n",
@@ -132,6 +132,7 @@
                 "# load in all runs\n",
                 "model_names = list(PREDICTORS.keys())\n",
                 "cv_results = cv_for_all_algs(\n",
+                "    ct_entry,\n",
                 "    uuid_list=all_users,\n",
                 "    expanded_trip_df_map=expanded_labeled_trip_df_map,\n",
                 "    model_names=model_names,\n",

diff --git a/TRB_label_assist/cluster_performance.ipynb b/TRB_label_assist/cluster_performance.ipynb
@@ -15,11 +15,10 @@
             "source": [
                 "%load_ext autoreload\n",
                 "%autoreload 2\n",
-                "\n",
                 "import pandas as pd\n",
                 "import matplotlib.pyplot as plt\n",
                 "from matplotlib.gridspec import GridSpec\n",
-                "\n",
+                "import emission.analysis.modelling.trip_model.run_model as eamtr\n",
                 "import emission.storage.timeseries.abstract_timeseries as esta\n",
                 "import emission.storage.decorations.trip_queries as esdtq\n",
                 "import performance_eval\n",
@@ -45,10 +44,11 @@
                 "labeled_trip_df_map = {}\n",
                 "expanded_labeled_trip_df_map = {}\n",
                 "expanded_all_trip_df_map = {}\n",
+                "ct_entry={}\n",
                 "for u in all_users:\n",
                 "    ts = esta.TimeSeries.get_time_series(u)\n",
-                "    ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
-                "\n",
+                "    ct_entry[u]=eamtr._get_training_data(u,None)    \n",
+                "    ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])    \n",
                 "    confirmed_trip_df_map[u] = ct_df\n",
                 "    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
                 "    expanded_labeled_trip_df_map[u] = esdtq.expand_userinputs(\n",
@@ -87,6 +87,8 @@
                 "\n",
                 "    all_results_df = performance_eval.run_eval_cluster_metrics(\n",
                 "        expanded_labeled_trip_df_map,\n",
+                "        ct_entry,\n",
+                "        clustering_way='destination',\n",
                 "        user_list=all_users,\n",
                 "        radii=radii,\n",
                 "        loc_type='end',\n",
@@ -265,6 +267,8 @@
                 "\n",
                 "SVM_results_df = performance_eval.run_eval_cluster_metrics(\n",
                 "    expanded_labeled_trip_df_map,\n",
+                "    ct_entry,\n",
+                "    clustering_way=\"destination\",\n",
                 "    user_list=all_users,\n",
                 "    radii=radii,\n",
                 "    loc_type='end',\n",

diff --git a/TRB_label_assist/clustering.py b/TRB_label_assist/clustering.py
@@ -16,8 +16,8 @@
 # our imports
 # NOTE: this requires changing the branch of e-mission-server to
 # eval-private-data-compatibility
-import emission.analysis.modelling.tour_model_extended.similarity as eamts
 import emission.storage.decorations.trip_queries as esdtq
+import emission.analysis.modelling.trip_model.greedy_similarity_binning as eamtg
 
 EARTH_RADIUS = 6371000
 ALG_OPTIONS = [
@@ -28,9 +28,27 @@
     'mean_shift'
 ]
 
+def cleanEntryTypeData(loc_df,trip_entry):
+
+    """
+    Helps weed out entries from the list of entries which were removed from the df using
+    esdtq.filter_labeled_trips() and esdtq.expand_userinputs()
+
+    loc_df : dataframe amde from entry type data
+    trip_entry : the entry type equivalent of loc_df ,
+                which was passed alongside the dataframe while loading the data
+
+    """
+
+    ids_in_df=loc_df['_id']
+    filtered_trip_entry = list(filter(lambda entry: entry['_id'] in ids_in_df.values, trip_entry))
+    return filtered_trip_entry
+
 
 def add_loc_clusters(
         loc_df,
+        trip_entry,
+        clustering_way,
         radii,
         loc_type,
         alg,
@@ -53,6 +71,9 @@ def add_loc_clusters(
         Args:
             loc_df (dataframe): must have columns 'start_lat' and 'start_lon' 
                 or 'end_lat' and 'end_lon'
+            trip_entry ( list of Entry/confirmedTrip): list consisting all entries from the
+                time data was loaded. loc_df was obtained from this by converting to df and 
+                then filtering out labeled trips and expanding user_inputs   
             radii (int list): list of radii to run the clustering algs with
             loc_type (str): 'start' or 'end'
             alg (str): 'DBSCAN', 'naive', 'OPTICS', 'SVM', 'fuzzy', or
@@ -98,19 +119,25 @@ def add_loc_clusters(
             loc_df.loc[:, f"{loc_type}_DBSCAN_clusters_{r}_m"] = labels
 
     elif alg == 'naive':
+
+        cleaned_trip_entry= cleanEntryTypeData(loc_df,trip_entry)
+
         for r in radii:
             # this is using a modified Similarity class that bins start/end
             # points separately before creating trip-level bins
-            sim_model = eamts.Similarity(loc_df,
-                                         radius_start=r,
-                                         radius_end=r,
-                                         shouldFilter=False,
-                                         cutoff=False)
-            # we only bin the loc_type points to speed up the alg. avoid
-            # unnecessary binning since this is really slow
-            sim_model.bin_helper(loc_type=loc_type)
-            labels = sim_model.data_df[loc_type + '_bin'].to_list()
 
+            model_config = {
+                "metric": "od_similarity",
+                "similarity_threshold_meters": r,  # meters,
+                "apply_cutoff": False,
+                "clustering_way": clustering_way,
+                "shouldFilter":False,
+                "incremental_evaluation": False
+            }    
+
+            sim_model = eamtg.GreedySimilarityBinning(model_config)       
+            sim_model.fit(cleaned_trip_entry)
+            labels = [int(l) for l in sim_model.tripLabels]
             # # pd.Categorical converts the type from int to category (so
             # # numerical operations aren't possible)
             # loc_df.loc[:, f"{loc_type}_{alg}_clusters_{r}_m"] = pd.Categorical(

diff --git a/TRB_label_assist/clustering_examples.ipynb b/TRB_label_assist/clustering_examples.ipynb
@@ -26,12 +26,11 @@
                 "%autoreload 2\n",
                 "\n",
                 "from uuid import UUID\n",
-                "\n",
                 "import emission.storage.timeseries.abstract_timeseries as esta\n",
                 "import emission.storage.decorations.trip_queries as esdtq\n",
                 "import emission.core.get_database as edb\n",
-                "\n",
-                "import mapping"
+                "import emission.analysis.modelling.trip_model.run_model as eamtr\n",
+                "import mapping\n"
             ]
         },
         {
@@ -60,9 +59,11 @@
                 "confirmed_trip_df_map = {}\n",
                 "labeled_trip_df_map = {}\n",
                 "expanded_trip_df_map = {}\n",
+                "ct_entry={}\n",
                 "for u in uuids:\n",
                 "    ts = esta.TimeSeries.get_time_series(u)\n",
-                "    ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
+                "    ct_entry[u]=eamtr._get_training_data(u,None)    \n",
+                "    ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])    \n",
                 "    confirmed_trip_df_map[u] = ct_df\n",
                 "    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
                 "    expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
@@ -83,8 +84,10 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
+                "                                 ct_entry[suburban_uuid],\n",
                 "                                 alg='naive',\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150])\n",
@@ -98,8 +101,10 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
+                "                                 ct_entry[college_campus_uuid],\n",
                 "                                 alg='naive',\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150])\n",
@@ -121,9 +126,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
+                "                                 ct_entry[suburban_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=False,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150, 200])\n",
@@ -137,9 +144,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
+                "                                 ct_entry[college_campus_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=False,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150, 200])\n",
@@ -161,9 +170,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[suburban_uuid],\n",
+                "                                 ct_entry[suburban_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=True,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150, 200])\n",
@@ -177,9 +188,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[college_campus_uuid],\n",
+                "                                 ct_entry[college_campus_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=True,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way=\"destination\",\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150, 200])\n",

diff --git a/TRB_label_assist/generate_figs_for_poster.ipynb b/TRB_label_assist/generate_figs_for_poster.ipynb
@@ -29,15 +29,14 @@
                 "import numpy as np\n",
                 "import matplotlib.pyplot as plt\n",
                 "import matplotlib\n",
-                "\n",
                 "from sklearn.pipeline import make_pipeline\n",
                 "from sklearn.preprocessing import StandardScaler\n",
                 "from sklearn import svm\n",
                 "\n",
                 "import emission.storage.timeseries.abstract_timeseries as esta\n",
                 "import emission.storage.decorations.trip_queries as esdtq\n",
                 "import emission.core.get_database as edb\n",
-                "\n",
+                "import emission.analysis.modelling.trip_model.run_model as eamtr\n",
                 "import mapping\n",
                 "import data_wrangling\n",
                 "from clustering import add_loc_clusters"
@@ -67,9 +66,11 @@
                 "confirmed_trip_df_map = {}\n",
                 "labeled_trip_df_map = {}\n",
                 "expanded_trip_df_map = {}\n",
+                "ct_entry={}\n",
                 "for u in uuids:\n",
                 "    ts = esta.TimeSeries.get_time_series(u)\n",
-                "    ct_df = ts.get_data_df(\"analysis/confirmed_trip\")\n",
+                "    ct_entry[u]=eamtr._get_training_data(u,None)    \n",
+                "    ct_df = ts.to_data_df(\"analysis/confirmed_trip\",ct_entry[u])    \n",
                 "    confirmed_trip_df_map[u] = ct_df\n",
                 "    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)\n",
                 "    expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])"
@@ -98,8 +99,10 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[user1_uuid],\n",
+                "                                 ct_entry[user1_uuid],\n",
                 "                                 alg='naive',\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way='destination',\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[50, 100, 150])\n",
@@ -137,9 +140,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[user2_uuid],\n",
+                "                                 ct_entry[user2_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=False,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way='destination',\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[150])\n",
@@ -161,9 +166,11 @@
             "outputs": [],
             "source": [
                 "fig = mapping.find_plot_clusters(expanded_trip_df_map[user2_uuid],\n",
+                "                                 ct_entry[user2_uuid],\n",
                 "                                 alg='DBSCAN',\n",
                 "                                 SVM=True,\n",
                 "                                 loc_type='end',\n",
+                "                                 clustering_way='destination',\n",
                 "                                 plot_unlabeled=False,\n",
                 "                                 cluster_unlabeled=False,\n",
                 "                                 radii=[150])\n",
@@ -289,8 +296,14 @@
                 "\n",
                 "    labeled_trips_df = all_trips_df.loc[all_trips_df.user_input != {}]\n",
                 "    df_for_cluster = all_trips_df if cluster_unlabeled else labeled_trips_df\n",
-                "\n",
+                "    if loc_type=='start':\n",
+                "        clustering_way='origin'\n",
+                "    else:\n",
+                "        clustering_way='destination'\n",
+                "    \n",
                 "    df_for_cluster = add_loc_clusters(df_for_cluster,\n",
+                "                                      ct_entry,\n",
+                "                                      clustering_way=clustering_way,\n",
                 "                                      radii=radii,\n",
                 "                                      alg=alg,\n",
                 "                                      loc_type=loc_type,\n",