Merge pull request #17 from akapoorcern/master

Fixed weights cleanup
cms-egamma · Jul 20, 2021 · 834c963 · 834c963
2 parents 70cd909 + 868ce0d
commit 834c963
Show file tree

Hide file tree

Showing 12 changed files with 330 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,23 @@ Salient features:
 6) Multi-Class training possible
 7) Ability to customize thresholds
 
-### Setting up
+### What will be the output of the trainer:
+
+1) Feature distributions
+2) Statistics in training and testing
+3) ROCs, loss plots, MVA scores
+4) Confusion Matrices
+5) Correlation plots
+6) Trained models (h5 or pkl files)
+
+#### Optional outputs
+
+1) Threshold values of scores for chosen working points
+2) Efficiency vs pT and Efficiency vs eta plots for all classes
+3) Reweighting plots for pT and eta
+4) Comparison of new ID performance with benchmark ID flags
+
+# Setting up
 
 #### Clone
 ```
@@ -35,11 +51,15 @@ In principle, you can set this up on your local computer by installing packages
 Use LCG 97python3 and you will have all the dependencies! (Tested at lxplus and SWAN)
 `source /cvmfs/sft.cern.ch/lcg/views/LCG_97python3/x86_64-centos7-gcc8-opt/setup.sh`
 
-#### Run on CPUs and GPUs
+#### Run on GPUs
 
-The code can also transparently use a GPU, if a GPU card is available. The cvmfs release to use in that case is:
+The code can also transparently use a GPU, if a GPU card is available. Although, all packages need to be setup correctly.
+For GPU in tensorflow, you can use a cvmfs release is available:
 `source /cvmfs/sft.cern.ch/lcg/views/LCG_97py3cu10/x86_64-centos7-gcc7-opt/setup.sh`
 
+For XGBoost, while the code will use it automatically, it needs a GPU compiled XGBoost with CUDA >10.0. This is at the moment not possible with any cvmfs release.
+You can cartainly setup packages locally.
+
 
 ### Running the trainer
 
@@ -54,9 +74,11 @@ The Trainer will read the settings from the config file and run training
 
 Projects where the framework has been helpful
 
-1) Run-3 Ele MVA ID
-2) Close photon analysis
-3) H->eeg analysis
+1) Run-3 Electron MVA ID
+2) Run-3 PF Electron ID
+3) Run-3 PF Photon ID
+4) Close photon analysis
+5) H->eeg analysis
 
 ##########################################
 
@@ -74,19 +96,18 @@ from tensorflow.keras.callbacks import EarlyStopping
 ```
 
 
-
-#### All the Parameters
+# All the Parameters
 
 | Parameters |Type| Description|
 | --------------- | ----------------| ---------------- |
 | `OutputDirName` |string| All plots, models, config file will be stored in this directory. This will be automatically created. If it already exists, it will overwrite everything if you run it again with the same `OutputDirName`|
 | `branches` |list of strings| Branches to read (Should be in the input root files). Only these branches can be later used for any purpose. The '\*' is useful for selecting pattern-based branches. In principle one can do ``` branches=["*"] ```, but remember that the data loading time increases, if you select more branches|
 |`SaveDataFrameCSV`|boolean| If True, this will save the data frame as a parquet file and the next time you run the same training with different parameters, it will be much faster|
 |`loadfromsaved`|boolean| If root files and branches are the same as previous training and SaveDataFrameCSV was True, you can assign this as `True`, and data loading time will reduce significantly. Remember that this will use the same output directory as mentioned using `OutputDirName`, so the data frame should be present there|
-|`Classes` | list of strings | Two or more classes possible. For two classes the code will do a binary classification. For more than two classes Can be anything but samples will be later loaded under this scheme. Example: `Classes=['DY','TTBar']` or `Classes=['Class1','Class2','Class3']`. The order is important if you want to make an ID. In case of two classes, the first class has to be Signal of interest. The second has to be background.|
+|`Classes` | list of strings | Two or more classes possible. For two classes the code will do a binary classification. For more than two classes Can be anything but samples will be later loaded under this scheme. Example: `Classes=['DY','TTBar']` or `Classes=['Class1','Class2','Class3']`. The order is important if you want to make an ID. In case of two classes, the first class has to be a Signal of interest. The second has to be a background. In multiclass, it does not matter which order one is using, but it is highly recommended that the first class is signal, if it is known. |
 |`ClassColors`|list of strings|Colors for `Classes` to use in plots. Standard python colors work!|
 |`Tree`| string |Location of the tree inside the root file|
-|`processes`| list of dictionaries| You can add as many process files as you like and assign them to a specific class. For example WZ.root and TTBar.root could be 'Background' class and DY.root could be 'Signal' or both 'Signal and 'background' can come from the same root file. In fact you can have, as an example: 4 classes and 5 root files. The Trainer will take care of it at the backend. Look at the sample  config below to see how processes are added. It is a list of dictionaries, with one example dictionary looking like this ` {'Class':'IsolatedSignal','path':['./DY.root','./Zee.root'], 'xsecwt': 1, 'selection':'(ele_pt > 5) & (abs(scl_eta) < 1.442) & (abs(scl_eta) < 2.5) & (matchedToGenEle==1)'} ` |
+|`processes`| list of dictionaries| You can add as many process files as you like and assign them to a specific class. For example WZ.root and TTBar.root could be 'Background' class and DY.root could be 'Signal' or both 'Signal and 'background' can come from the same root file. In fact you can have, as an example: 4 classes and 5 root files. The Trainer will take care of it at the backend. Look at the sample config below to see how processes are added. It is a list of dictionaries, with one example dictionary looking like this ` {'Class':'IsolatedSignal','path':['./DY.root','./Zee.root'], 'xsecwt': 1, 'selection':'(ele_pt > 5) & (abs(scl_eta) < 1.442) & (abs(scl_eta) < 2.5) & (matchedToGenEle==1)'} ` |
 |`MVAs`|list of dictionaries| MVAs to use. You can add as many as you like: MVAtypes XGB and DNN are keywords, so names can be XGB_new, DNN_old etc, but keep XGB and DNN in the names (That is how the framework identifies which algo to run). Look at the sample config below to see how MVAs are added. |
 
 #### Optional Parameters

diff --git a/Tools/readData.py b/Tools/readData.py
@@ -9,22 +9,24 @@
 import gc
 
 def daskframe_from_rootfiles(processes, treepath,branches,flatten='False',debug=False):
-    def get_df(Class,file, xsecwt, selection, treepath=None,branches=['ele*']):
+    def get_df(Class,file, xsecwt, selection, treepath=None,branches=['ele*'],multfactor=1):
         tree = uproot.open(file)[treepath]
         if debug:
             ddd=tree.pandas.df(branches=branches,flatten=flatten,entrystop=1000).query(selection)
         else:
             ddd=tree.pandas.df(branches=branches,flatten=flatten).query(selection)
         #ddd["Category"]=Category
         ddd["Class"]=Class
-        if type(xsecwt) == type("hello"):
+        if type(xsecwt) == type(('xsec',2)):
+            ddd["xsecwt"]=ddd[xsecwt[0]]*xsecwt[1]
+        elif type(xsecwt) == type("hello"):
             ddd["xsecwt"]=ddd[xsecwt]
         elif type(xsecwt) == type(0.1):
             ddd["xsecwt"]=xsecwt
         elif type(xsecwt) == type(1):
             ddd["xsecwt"]=xsecwt
         else:
-            print("CAUTION: xsecwt should be a branch name or a number... Assigning the weight as 1")        
+            print("CAUTION: xsecwt should be a branch name or a number or a tuple... Assigning the weight as 1")        
         print(file)
         return ddd
 

diff --git a/Trainer.ipynb b/Trainer.ipynb
@@ -333,9 +333,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "if hasattr(Conf,'modifydf'):\n",
-    "    if callable(getattr(Conf,'modifydf')):\n",
-    "        Conf.modifydf(df_final)"
+    "try:\n",
+    "    Conf.modifydf(df_final)\n",
+    "    print(\"Dataframe modification is done using modifydf\")\n",
+    "except:\n",
+    "    print(\"Looks fine\")"
    ]
   },
   {
@@ -579,7 +581,7 @@
   {
    "cell_type": "code",
    "execution_count": 26,
-   "id": "114ef58a",
+   "id": "9b520a4d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1170,6 +1172,10 @@
     "            df_final.loc[TestIndices,MVA[\"MVAtype\"]+\"_pred\"]=np.sum([modelDNN.predict(X_test,batch_size=5000)[:, 0],modelDNN.predict(X_test,batch_size=5000)[:, 1]],axis=0)\n",
     "            \n",
     "        ###############DNN#######################################\n",
+    "        \n",
+    "    plotwt_train=np.asarray(df_final.loc[TrainIndices,'xsecwt'])\n",
+    "    plotwt_test=np.asarray(df_final.loc[TestIndices,'xsecwt'])\n",
+    "    \n",
     "    from sklearn.metrics import confusion_matrix\n",
     "    fig, axes = plt.subplots(1, 1, figsize=(len(Conf.Classes)*2, len(Conf.Classes)*2))\n",
     "    cm = confusion_matrix(Y_test.argmax(axis=1), y_test_pred.argmax(axis=1))\n",
@@ -1198,10 +1204,10 @@
     "            ax=axes[i]\n",
     "            for k in range(n_classes):\n",
     "                axMVA.hist(y_test_pred[:, i][Y_test[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_test',\n",
-    "                           weights=Wt_test[Y_test[:, k]==1]/np.sum(Wt_test[Y_test[:, k]==1]),\n",
+    "                           weights=plotwt_test[Y_test[:, k]==1]/np.sum(plotwt_test[Y_test[:, k]==1]),\n",
     "                           histtype='step',linewidth=2,color=Conf.ClassColors[k])\n",
     "                axMVA.hist(y_train_pred[:, i][Y_train[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_train',\n",
-    "                           weights=Wt_train[Y_train[:, k]==1]/np.sum(Wt_train[Y_train[:, k]==1]),\n",
+    "                           weights=plotwt_train[Y_train[:, k]==1]/np.sum(plotwt_train[Y_train[:, k]==1]),\n",
     "                           histtype='stepfilled',alpha=0.3,linewidth=2,color=Conf.ClassColors[k])\n",
     "            axMVA.set_title(MVA[\"MVAtype\"]+' Score: Node '+str(i+1),fontsize=10)\n",
     "            axMVA.set_xlabel('Score',fontsize=10)\n",
@@ -1210,8 +1216,8 @@
     "            if Conf.MVAlogplot:\n",
     "                axMVA.set_xscale('log')\n",
     "\n",
-    "            fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i])\n",
-    "            fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i])\n",
+    "            fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i],sample_weight=plotwt_test)\n",
+    "            fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i],sample_weight=plotwt_train)\n",
     "            mask = tpr > 0.0\n",
     "            fpr, tpr = fpr[mask], tpr[mask]\n",
     "\n",
@@ -1270,8 +1276,8 @@
     "            plot_single_roc_point(df_final.query('TrainDataset==0'), var=OverlayWpi, ax=axes, color=color, marker='o', markersize=8, label=OverlayWpi+\" Test dataset\", cat=cat,Wt=weight)\n",
     "    if len(Conf.MVAs)>0:\n",
     "        for MVAi in Conf.MVAs:\n",
-    "            plot_roc_curve(df_final.query('TrainDataset==0'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='--', label=MVAi[\"Label\"]+' Testing',cat=cat,Wt=weight)\n",
-    "            plot_roc_curve(df_final.query('TrainDataset==1'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='-', label=MVAi[\"Label\"]+' Training',cat=cat,Wt=weight)\n",
+    "            plot_roc_curve(df_final.query('TrainDataset==0'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='--', label=MVAi[\"Label\"]+' Testing',cat=cat,Wt='xsecwt')\n",
+    "            plot_roc_curve(df_final.query('TrainDataset==1'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='-', label=MVAi[\"Label\"]+' Training',cat=cat,Wt='xsecwt')\n",
     "        axes.set_ylabel(\"Background efficiency (%)\")\n",
     "        axes.set_xlabel(\"Signal efficiency  (%)\")\n",
     "        axes.set_title(\"Final\")\n",
@@ -1528,7 +1534,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.13"
+   "version": "3.9.5"
   }
  },
  "nbformat": 4,

diff --git a/Trainer.py b/Trainer.py
@@ -181,9 +181,11 @@ def modify(df):
 # In[16]:
 
 
-if hasattr(Conf,'modifydf'):
-    if callable(getattr(Conf,'modifydf')):
-        Conf.modifydf(df_final)
+try:
+    Conf.modifydf(df_final)
+    print("Dataframe modification is done using modifydf")
+except:
+    print("Looks fine")
 
 
 # In[17]:
@@ -490,6 +492,10 @@ def corre(df,Classes=[''],MVA={}):
             df_final.loc[TestIndices,MVA["MVAtype"]+"_pred"]=np.sum([modelDNN.predict(X_test,batch_size=5000)[:, 0],modelDNN.predict(X_test,batch_size=5000)[:, 1]],axis=0)
 
         ###############DNN#######################################
+
+    plotwt_train=np.asarray(df_final.loc[TrainIndices,'xsecwt'])
+    plotwt_test=np.asarray(df_final.loc[TestIndices,'xsecwt'])
+
     from sklearn.metrics import confusion_matrix
     fig, axes = plt.subplots(1, 1, figsize=(len(Conf.Classes)*2, len(Conf.Classes)*2))
     cm = confusion_matrix(Y_test.argmax(axis=1), y_test_pred.argmax(axis=1))
@@ -518,10 +524,10 @@ def corre(df,Classes=[''],MVA={}):
             ax=axes[i]
             for k in range(n_classes):
                 axMVA.hist(y_test_pred[:, i][Y_test[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_test',
-                           weights=Wt_test[Y_test[:, k]==1]/np.sum(Wt_test[Y_test[:, k]==1]),
+                           weights=plotwt_test[Y_test[:, k]==1]/np.sum(plotwt_test[Y_test[:, k]==1]),
                            histtype='step',linewidth=2,color=Conf.ClassColors[k])
                 axMVA.hist(y_train_pred[:, i][Y_train[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_train',
-                           weights=Wt_train[Y_train[:, k]==1]/np.sum(Wt_train[Y_train[:, k]==1]),
+                           weights=plotwt_train[Y_train[:, k]==1]/np.sum(plotwt_train[Y_train[:, k]==1]),
                            histtype='stepfilled',alpha=0.3,linewidth=2,color=Conf.ClassColors[k])
             axMVA.set_title(MVA["MVAtype"]+' Score: Node '+str(i+1),fontsize=10)
             axMVA.set_xlabel('Score',fontsize=10)
@@ -530,8 +536,8 @@ def corre(df,Classes=[''],MVA={}):
             if Conf.MVAlogplot:
                 axMVA.set_xscale('log')
 
-            fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i])
-            fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i])
+            fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i],sample_weight=plotwt_test)
+            fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i],sample_weight=plotwt_train)
             mask = tpr > 0.0
             fpr, tpr = fpr[mask], tpr[mask]
 
@@ -582,8 +588,8 @@ def corre(df,Classes=[''],MVA={}):
             plot_single_roc_point(df_final.query('TrainDataset==0'), var=OverlayWpi, ax=axes, color=color, marker='o', markersize=8, label=OverlayWpi+" Test dataset", cat=cat,Wt=weight)
     if len(Conf.MVAs)>0:
         for MVAi in Conf.MVAs:
-            plot_roc_curve(df_final.query('TrainDataset==0'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='--', label=MVAi["Label"]+' Testing',cat=cat,Wt=weight)
-            plot_roc_curve(df_final.query('TrainDataset==1'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='-', label=MVAi["Label"]+' Training',cat=cat,Wt=weight)
+            plot_roc_curve(df_final.query('TrainDataset==0'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='--', label=MVAi["Label"]+' Testing',cat=cat,Wt='xsecwt')
+            plot_roc_curve(df_final.query('TrainDataset==1'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='-', label=MVAi["Label"]+' Training',cat=cat,Wt='xsecwt')
         axes.set_ylabel("Background efficiency (%)")
         axes.set_xlabel("Signal efficiency  (%)")
         axes.set_title("Final")