Skip to content

Commit

Permalink
Merge pull request #17 from akapoorcern/master
Browse files Browse the repository at this point in the history
Fixed weights cleanup
  • Loading branch information
a-kapoor authored Jul 20, 2021
2 parents 70cd909 + 868ce0d commit 834c963
Show file tree
Hide file tree
Showing 12 changed files with 330 additions and 41 deletions.
41 changes: 31 additions & 10 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,23 @@ Salient features:
6) Multi-Class training possible
7) Ability to customize thresholds

### Setting up
### What will be the output of the trainer:

1) Feature distributions
2) Statistics in training and testing
3) ROCs, loss plots, MVA scores
4) Confusion Matrices
5) Correlation plots
6) Trained models (h5 or pkl files)

#### Optional outputs

1) Threshold values of scores for chosen working points
2) Efficiency vs pT and Efficiency vs eta plots for all classes
3) Reweighting plots for pT and eta
4) Comparison of new ID performance with benchmark ID flags

# Setting up

#### Clone
```
Expand All @@ -35,11 +51,15 @@ In principle, you can set this up on your local computer by installing packages
Use LCG 97python3 and you will have all the dependencies! (Tested at lxplus and SWAN)
`source /cvmfs/sft.cern.ch/lcg/views/LCG_97python3/x86_64-centos7-gcc8-opt/setup.sh`

#### Run on CPUs and GPUs
#### Run on GPUs

The code can also transparently use a GPU, if a GPU card is available. The cvmfs release to use in that case is:
The code can also transparently use a GPU, if a GPU card is available. Although, all packages need to be setup correctly.
For GPU in tensorflow, you can use a cvmfs release is available:
`source /cvmfs/sft.cern.ch/lcg/views/LCG_97py3cu10/x86_64-centos7-gcc7-opt/setup.sh`

For XGBoost, while the code will use it automatically, it needs a GPU compiled XGBoost with CUDA >10.0. This is at the moment not possible with any cvmfs release.
You can cartainly setup packages locally.


### Running the trainer

Expand All @@ -54,9 +74,11 @@ The Trainer will read the settings from the config file and run training

Projects where the framework has been helpful

1) Run-3 Ele MVA ID
2) Close photon analysis
3) H->eeg analysis
1) Run-3 Electron MVA ID
2) Run-3 PF Electron ID
3) Run-3 PF Photon ID
4) Close photon analysis
5) H->eeg analysis

##########################################

Expand All @@ -74,19 +96,18 @@ from tensorflow.keras.callbacks import EarlyStopping
```



#### All the Parameters
# All the Parameters

| Parameters |Type| Description|
| --------------- | ----------------| ---------------- |
| `OutputDirName` |string| All plots, models, config file will be stored in this directory. This will be automatically created. If it already exists, it will overwrite everything if you run it again with the same `OutputDirName`|
| `branches` |list of strings| Branches to read (Should be in the input root files). Only these branches can be later used for any purpose. The '\*' is useful for selecting pattern-based branches. In principle one can do ``` branches=["*"] ```, but remember that the data loading time increases, if you select more branches|
|`SaveDataFrameCSV`|boolean| If True, this will save the data frame as a parquet file and the next time you run the same training with different parameters, it will be much faster|
|`loadfromsaved`|boolean| If root files and branches are the same as previous training and SaveDataFrameCSV was True, you can assign this as `True`, and data loading time will reduce significantly. Remember that this will use the same output directory as mentioned using `OutputDirName`, so the data frame should be present there|
|`Classes` | list of strings | Two or more classes possible. For two classes the code will do a binary classification. For more than two classes Can be anything but samples will be later loaded under this scheme. Example: `Classes=['DY','TTBar']` or `Classes=['Class1','Class2','Class3']`. The order is important if you want to make an ID. In case of two classes, the first class has to be Signal of interest. The second has to be background.|
|`Classes` | list of strings | Two or more classes possible. For two classes the code will do a binary classification. For more than two classes Can be anything but samples will be later loaded under this scheme. Example: `Classes=['DY','TTBar']` or `Classes=['Class1','Class2','Class3']`. The order is important if you want to make an ID. In case of two classes, the first class has to be a Signal of interest. The second has to be a background. In multiclass, it does not matter which order one is using, but it is highly recommended that the first class is signal, if it is known. |
|`ClassColors`|list of strings|Colors for `Classes` to use in plots. Standard python colors work!|
|`Tree`| string |Location of the tree inside the root file|
|`processes`| list of dictionaries| You can add as many process files as you like and assign them to a specific class. For example WZ.root and TTBar.root could be 'Background' class and DY.root could be 'Signal' or both 'Signal and 'background' can come from the same root file. In fact you can have, as an example: 4 classes and 5 root files. The Trainer will take care of it at the backend. Look at the sample config below to see how processes are added. It is a list of dictionaries, with one example dictionary looking like this ` {'Class':'IsolatedSignal','path':['./DY.root','./Zee.root'], 'xsecwt': 1, 'selection':'(ele_pt > 5) & (abs(scl_eta) < 1.442) & (abs(scl_eta) < 2.5) & (matchedToGenEle==1)'} ` |
|`processes`| list of dictionaries| You can add as many process files as you like and assign them to a specific class. For example WZ.root and TTBar.root could be 'Background' class and DY.root could be 'Signal' or both 'Signal and 'background' can come from the same root file. In fact you can have, as an example: 4 classes and 5 root files. The Trainer will take care of it at the backend. Look at the sample config below to see how processes are added. It is a list of dictionaries, with one example dictionary looking like this ` {'Class':'IsolatedSignal','path':['./DY.root','./Zee.root'], 'xsecwt': 1, 'selection':'(ele_pt > 5) & (abs(scl_eta) < 1.442) & (abs(scl_eta) < 2.5) & (matchedToGenEle==1)'} ` |
|`MVAs`|list of dictionaries| MVAs to use. You can add as many as you like: MVAtypes XGB and DNN are keywords, so names can be XGB_new, DNN_old etc, but keep XGB and DNN in the names (That is how the framework identifies which algo to run). Look at the sample config below to see how MVAs are added. |

#### Optional Parameters
Expand Down
8 changes: 5 additions & 3 deletions Tools/readData.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,24 @@
import gc

def daskframe_from_rootfiles(processes, treepath,branches,flatten='False',debug=False):
def get_df(Class,file, xsecwt, selection, treepath=None,branches=['ele*']):
def get_df(Class,file, xsecwt, selection, treepath=None,branches=['ele*'],multfactor=1):
tree = uproot.open(file)[treepath]
if debug:
ddd=tree.pandas.df(branches=branches,flatten=flatten,entrystop=1000).query(selection)
else:
ddd=tree.pandas.df(branches=branches,flatten=flatten).query(selection)
#ddd["Category"]=Category
ddd["Class"]=Class
if type(xsecwt) == type("hello"):
if type(xsecwt) == type(('xsec',2)):
ddd["xsecwt"]=ddd[xsecwt[0]]*xsecwt[1]
elif type(xsecwt) == type("hello"):
ddd["xsecwt"]=ddd[xsecwt]
elif type(xsecwt) == type(0.1):
ddd["xsecwt"]=xsecwt
elif type(xsecwt) == type(1):
ddd["xsecwt"]=xsecwt
else:
print("CAUTION: xsecwt should be a branch name or a number... Assigning the weight as 1")
print("CAUTION: xsecwt should be a branch name or a number or a tuple... Assigning the weight as 1")
print(file)
return ddd

Expand Down
28 changes: 17 additions & 11 deletions Trainer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -333,9 +333,11 @@
"metadata": {},
"outputs": [],
"source": [
"if hasattr(Conf,'modifydf'):\n",
" if callable(getattr(Conf,'modifydf')):\n",
" Conf.modifydf(df_final)"
"try:\n",
" Conf.modifydf(df_final)\n",
" print(\"Dataframe modification is done using modifydf\")\n",
"except:\n",
" print(\"Looks fine\")"
]
},
{
Expand Down Expand Up @@ -579,7 +581,7 @@
{
"cell_type": "code",
"execution_count": 26,
"id": "114ef58a",
"id": "9b520a4d",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -1170,6 +1172,10 @@
" df_final.loc[TestIndices,MVA[\"MVAtype\"]+\"_pred\"]=np.sum([modelDNN.predict(X_test,batch_size=5000)[:, 0],modelDNN.predict(X_test,batch_size=5000)[:, 1]],axis=0)\n",
" \n",
" ###############DNN#######################################\n",
" \n",
" plotwt_train=np.asarray(df_final.loc[TrainIndices,'xsecwt'])\n",
" plotwt_test=np.asarray(df_final.loc[TestIndices,'xsecwt'])\n",
" \n",
" from sklearn.metrics import confusion_matrix\n",
" fig, axes = plt.subplots(1, 1, figsize=(len(Conf.Classes)*2, len(Conf.Classes)*2))\n",
" cm = confusion_matrix(Y_test.argmax(axis=1), y_test_pred.argmax(axis=1))\n",
Expand Down Expand Up @@ -1198,10 +1204,10 @@
" ax=axes[i]\n",
" for k in range(n_classes):\n",
" axMVA.hist(y_test_pred[:, i][Y_test[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_test',\n",
" weights=Wt_test[Y_test[:, k]==1]/np.sum(Wt_test[Y_test[:, k]==1]),\n",
" weights=plotwt_test[Y_test[:, k]==1]/np.sum(plotwt_test[Y_test[:, k]==1]),\n",
" histtype='step',linewidth=2,color=Conf.ClassColors[k])\n",
" axMVA.hist(y_train_pred[:, i][Y_train[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_train',\n",
" weights=Wt_train[Y_train[:, k]==1]/np.sum(Wt_train[Y_train[:, k]==1]),\n",
" weights=plotwt_train[Y_train[:, k]==1]/np.sum(plotwt_train[Y_train[:, k]==1]),\n",
" histtype='stepfilled',alpha=0.3,linewidth=2,color=Conf.ClassColors[k])\n",
" axMVA.set_title(MVA[\"MVAtype\"]+' Score: Node '+str(i+1),fontsize=10)\n",
" axMVA.set_xlabel('Score',fontsize=10)\n",
Expand All @@ -1210,8 +1216,8 @@
" if Conf.MVAlogplot:\n",
" axMVA.set_xscale('log')\n",
"\n",
" fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i])\n",
" fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i])\n",
" fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i],sample_weight=plotwt_test)\n",
" fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i],sample_weight=plotwt_train)\n",
" mask = tpr > 0.0\n",
" fpr, tpr = fpr[mask], tpr[mask]\n",
"\n",
Expand Down Expand Up @@ -1270,8 +1276,8 @@
" plot_single_roc_point(df_final.query('TrainDataset==0'), var=OverlayWpi, ax=axes, color=color, marker='o', markersize=8, label=OverlayWpi+\" Test dataset\", cat=cat,Wt=weight)\n",
" if len(Conf.MVAs)>0:\n",
" for MVAi in Conf.MVAs:\n",
" plot_roc_curve(df_final.query('TrainDataset==0'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='--', label=MVAi[\"Label\"]+' Testing',cat=cat,Wt=weight)\n",
" plot_roc_curve(df_final.query('TrainDataset==1'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='-', label=MVAi[\"Label\"]+' Training',cat=cat,Wt=weight)\n",
" plot_roc_curve(df_final.query('TrainDataset==0'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='--', label=MVAi[\"Label\"]+' Testing',cat=cat,Wt='xsecwt')\n",
" plot_roc_curve(df_final.query('TrainDataset==1'),MVAi[\"MVAtype\"]+\"_pred\", tpr_threshold=0.0, ax=axes, color=MVAi[\"Color\"], linestyle='-', label=MVAi[\"Label\"]+' Training',cat=cat,Wt='xsecwt')\n",
" axes.set_ylabel(\"Background efficiency (%)\")\n",
" axes.set_xlabel(\"Signal efficiency (%)\")\n",
" axes.set_title(\"Final\")\n",
Expand Down Expand Up @@ -1528,7 +1534,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
"version": "3.9.5"
}
},
"nbformat": 4,
Expand Down
24 changes: 15 additions & 9 deletions Trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,9 +181,11 @@ def modify(df):
# In[16]:


if hasattr(Conf,'modifydf'):
if callable(getattr(Conf,'modifydf')):
Conf.modifydf(df_final)
try:
Conf.modifydf(df_final)
print("Dataframe modification is done using modifydf")
except:
print("Looks fine")


# In[17]:
Expand Down Expand Up @@ -490,6 +492,10 @@ def corre(df,Classes=[''],MVA={}):
df_final.loc[TestIndices,MVA["MVAtype"]+"_pred"]=np.sum([modelDNN.predict(X_test,batch_size=5000)[:, 0],modelDNN.predict(X_test,batch_size=5000)[:, 1]],axis=0)

###############DNN#######################################

plotwt_train=np.asarray(df_final.loc[TrainIndices,'xsecwt'])
plotwt_test=np.asarray(df_final.loc[TestIndices,'xsecwt'])

from sklearn.metrics import confusion_matrix
fig, axes = plt.subplots(1, 1, figsize=(len(Conf.Classes)*2, len(Conf.Classes)*2))
cm = confusion_matrix(Y_test.argmax(axis=1), y_test_pred.argmax(axis=1))
Expand Down Expand Up @@ -518,10 +524,10 @@ def corre(df,Classes=[''],MVA={}):
ax=axes[i]
for k in range(n_classes):
axMVA.hist(y_test_pred[:, i][Y_test[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_test',
weights=Wt_test[Y_test[:, k]==1]/np.sum(Wt_test[Y_test[:, k]==1]),
weights=plotwt_test[Y_test[:, k]==1]/np.sum(plotwt_test[Y_test[:, k]==1]),
histtype='step',linewidth=2,color=Conf.ClassColors[k])
axMVA.hist(y_train_pred[:, i][Y_train[:, k]==1],bins=np.linspace(0, 1, 21),label=Conf.Classes[k]+'_train',
weights=Wt_train[Y_train[:, k]==1]/np.sum(Wt_train[Y_train[:, k]==1]),
weights=plotwt_train[Y_train[:, k]==1]/np.sum(plotwt_train[Y_train[:, k]==1]),
histtype='stepfilled',alpha=0.3,linewidth=2,color=Conf.ClassColors[k])
axMVA.set_title(MVA["MVAtype"]+' Score: Node '+str(i+1),fontsize=10)
axMVA.set_xlabel('Score',fontsize=10)
Expand All @@ -530,8 +536,8 @@ def corre(df,Classes=[''],MVA={}):
if Conf.MVAlogplot:
axMVA.set_xscale('log')

fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i])
fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i])
fpr, tpr, th = roc_curve(Y_test[:, i], y_test_pred[:, i],sample_weight=plotwt_test)
fpr_tr, tpr_tr, th_tr = roc_curve(Y_train[:, i], y_train_pred[:, i],sample_weight=plotwt_train)
mask = tpr > 0.0
fpr, tpr = fpr[mask], tpr[mask]

Expand Down Expand Up @@ -582,8 +588,8 @@ def corre(df,Classes=[''],MVA={}):
plot_single_roc_point(df_final.query('TrainDataset==0'), var=OverlayWpi, ax=axes, color=color, marker='o', markersize=8, label=OverlayWpi+" Test dataset", cat=cat,Wt=weight)
if len(Conf.MVAs)>0:
for MVAi in Conf.MVAs:
plot_roc_curve(df_final.query('TrainDataset==0'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='--', label=MVAi["Label"]+' Testing',cat=cat,Wt=weight)
plot_roc_curve(df_final.query('TrainDataset==1'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='-', label=MVAi["Label"]+' Training',cat=cat,Wt=weight)
plot_roc_curve(df_final.query('TrainDataset==0'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='--', label=MVAi["Label"]+' Testing',cat=cat,Wt='xsecwt')
plot_roc_curve(df_final.query('TrainDataset==1'),MVAi["MVAtype"]+"_pred", tpr_threshold=0.0, ax=axes, color=MVAi["Color"], linestyle='-', label=MVAi["Label"]+' Training',cat=cat,Wt='xsecwt')
axes.set_ylabel("Background efficiency (%)")
axes.set_xlabel("Signal efficiency (%)")
axes.set_title("Final")
Expand Down
Loading

0 comments on commit 834c963

Please sign in to comment.