Tackling the Kaggle Histopathologic Cancer Detection Challenge to evaluate different machine-learning algorithms for identifying metastatic cancer in small image patches taken from larger digital pathology scans.
For this project, in order to understand how far traditional Computer-Vision techniques have evolved with the advent of Deep Learning, I start from the very basic of algorithms and iteratively improve each model's performance one small step at a time. Not all steps are guaranteed to improve performance, but it's necessary to try them to build a working intuition of what might work.
I start off with hand-engineered CV features (Color-Space Transforms, LBP, Gabor, Scharr, Laplacian, Harris, etc.) that work well with Shallow-ML models, and compare their performance against the automatic feature-extraction of large DL models.
Validation accuracy of the baseline model the started out at 53.2%. The best Shallow-ML model topped out at 87.2% using 60 hand-engineered features. The best CNN model topped out at 97.6%.
Step | Notebook | Description |
---|---|---|
1 | Data_Exploration | Exploratory Data Analysis |
2 | Data_HDF5 | Generate Grayscale+HED HDF5 dataset volume |
3 | Data_1D | Generate Naïve-1D flattened .npz from HDF5 for Shallow-ML |
4 | LogReg | Baseline Naïve-1D with Logistic Regression |
5 | Create_LBP_Feat | Generate LBP features and Evaluate on GBT classifier |
6 | LBP_Euclidean_vs_KLD | LBP histogram Dissimilarity metrics: Euclidean vs KL-Divergence |
7 | LogReg | Baseline LBP features with Logistic Regression |
8 | Find_Landmarks | Develop/Test algorithmn for finding set of 'Landmarks' |
9 | Generate_Landmarks | Generate Landmarks on Histopathology dataset LBP features |
10 | Create_LDist_Feat | Generate Distance-to-Landmarks (identified above) features |
11 | LogReg | Baseline Landmark features with Logistic Regression |
12 | PCA | Evaluate effect of PCA transformation on i. LBP and ii. Landmark features |
13 | SVM | Evaluate SVM model with i. LBP and ii. Landmark features |
14 | Create_5LBP_Feat | Generate 5-cell overlapping LBPs: 64x64px centered and 32x32px on four corners |
15 | GBT | Evaluate GBT model with 'Double-LBP' (scaling-pyramid: full-size 96x96px, half-size 48x48px) features |
16 | Create_COPOD_Feat | Classification using COPOD scores on LBP features |
17 | Create_2x2LBP_Feat | Add 2nd set of Rotation-Invariant LBP texture features |
18 | Create_Gabor_Feat | Add Gabor Filters (16x 2-D kernels) features |
19 | Create_Gabor_Scharr_Feat | Add Gabor+Scharr Gradient Filter features |
20 | Create_Laplacian_Feat | Add Laplacian Edge-Detection Filter features |
21 | Create_Harris_Feat | Add Harris Corner-Detection Filter features |
22 | GBT | Re-evaluate GBT model on aggregation of best Shallow-ML features |
23 | TPOT.ipynb | Evaluate TPOT Auto-ML on Shallow-ML features (Last of Shallow Models) |
24 | NN | Evaluate Neural Network with Shallow-ML (LBP, Gabor, Scharr) features |
25 | CNN_ModelA | Sequential CNN with Increasing # Conv2D filters |
26 | CNN_ModelB | Sequential CNN with Decreasing # Conv2D filters |
27 | CNN_ModelA-BD | CNN_ModelA on full 200k Train set |
28 | CNN_ModelD1-BD-AUG-N | Added Augmentations, Gaussian Noise, more Dropout |
29 | CNN_ModelF | Change last Conv2D from AvgPooling2D to Conv2D, Reduce Learning-Rate |