Skip to content

Latest commit

 

History

History
706 lines (558 loc) · 26.8 KB

ML_Beetles_analysis.md

File metadata and controls

706 lines (558 loc) · 26.8 KB

Using machine learning to predict/classify species of beetles

Data

Explanation of variables in the data:

  • ID: ID of the beetle specimen.

  • morphotype: The species of beetle. All within the genus Hydrobius. This is the outcome I want to predict.

  • 15 morphological features: Includes body size, length and width of the wings and pronotum, and smaller structures on the males. Most measured in mm or μm. Some features are ratios between two lengths. One feature characterizing the shape of a structure (mesoShape) is an angle that is measured in degrees. See publication for details.

Looking at the number of samples per species.

table(beetle_data$morphotype)
    arcticus     fuscipes rottenbergii  subrotundus 
          22           39           27           35 

Next, looking at how much missing data there is per variable:

# Missing per variables
naniar::vis_miss(beetle_data)

# Number of specimens with complete data
table(beetle_data[complete.cases(beetle_data), "morphotype"])
morphotype
    arcticus     fuscipes rottenbergii  subrotundus 
           4            3            3            3 

As seen, there are very few complete cases, primarily do high missingness in variables that were difficult/time-consuming to measure. Not ideal, but can imputation to fill in likely missing values.

Split data into train and test set

Before doing any imputation, need to split the data into train and test data. That way data leakage will be avoided.

Want to split the data to have 60% train and 40% test data. Using stratified splitting with species as the strata. Note that multiple splits could be done to test the effect of the split, but not done here. Bootstrapping methods for validation are also options later instead of using data-split method. Since the data is highly informative, want to also do a 25% train - 75% test set to really see if possible to get a good model with very little data.

# set the seed for reproducibility
set.seed(200)

# Get train index: y is the outcome variable to stratify with, p is the
# proportion in train
trainIndex_60_40 <- createDataPartition(beetle_data$morphotype, p = 0.6, list = FALSE,
    times = 1)

trainIndex_25_75 <- createDataPartition(beetle_data$morphotype, p = 0.25, list = FALSE,
    times = 1)

# Split the data in 2:
beetle_train_60_40 <- beetle_data[trainIndex_60_40, ]
beetle_test_60_40 <- beetle_data[-trainIndex_60_40, ]

beetle_train_25_75 <- beetle_data[trainIndex_25_75, ]
beetle_test_25_75 <- beetle_data[-trainIndex_25_75, ]

Check the distribution of species per dataset.

# Train
table(beetle_train_60_40$morphotype)
    arcticus     fuscipes rottenbergii  subrotundus 
          14           24           17           21 
table(beetle_train_25_75$morphotype)
    arcticus     fuscipes rottenbergii  subrotundus 
           6           10            7            9 
# Test
table(beetle_test_60_40$morphotype)
    arcticus     fuscipes rottenbergii  subrotundus 
           8           15           10           14 
table(beetle_test_25_75$morphotype)
    arcticus     fuscipes rottenbergii  subrotundus 
          16           29           20           26 

Imputation of missing data

In an ideal world I would want to use Multiple Imputation by Chained Equations (MICE) to get reasonable imputed datapoints. However, due to how the data is missing it is not possible to do this perfectly. First, MICE assumes linear relationships and no interactions between variables that will be imputed. We likely have both of these here, and even clear interactions between predictors and the outcome/species (i.e. 2-way interaction between (outcome x predictor Z) on predictor Y). Meaning that the effect of predictor Z on predictor Y will depend on the species. MICE treats the effect as being the same for all, i.e. no interaction, and would severely reduce the correlations between species and predictors. Secondly, key variables have high missing percentages.

The best option to handle the interactions, is to do the imputation within each species. While this will give good imputations overall, it uses information (the species identity) that would be unknown in real new data. As such, it implies that we would need to make sure to collect key variables for any new individual, as the validation is only valid for individuals with low missingness. Since this is a toy example, accept this limitation and for simplicity, only doing one imputed dataset, although ideally you want to do several. In some cases one variable was entirely co-linear with others and MICE didn’t impute those. Instead the mean value of the species was used as the imputation.

To reduce data leakage, imputation was done on train data, and the train data imputation rules were then used to impute the test data. Code not shown in markdown, see script if interested.

Visualize the data with PCA

Start by getting the matrix with predictors and scaling the variables.

# Obtain matrix with only numeric values, no morphotype of data_Type
beetle_matrix_60_40 <- beetle_data_imp_60_40 %>%
    select(-c(morphotype, data_type))
beetle_matrix_25_75 <- beetle_data_imp_25_75 %>%
    select(-c(morphotype, data_type))

# Scale the variables
beetle_matrix_scaled_60_40 <- beetle_matrix_60_40 %>%
    scale(.)
beetle_matrix_scaled_25_75 <- beetle_matrix_25_75 %>%
    scale(.)

Obtain the correlation matrix and plot this.

# Obtain correlation metrix
beetle_matrix_corr_60_40 <- cor(beetle_matrix_scaled_60_40)
beetle_matrix_corr_25_75 <- cor(beetle_matrix_scaled_25_75)

# Plot correlation
ggcorrplot(beetle_matrix_corr_60_40)

ggcorrplot(beetle_matrix_corr_25_75)

Next, calculate the components using the correlation matrix

# Do PCA
beetle_pca_60_40 <- princomp(beetle_matrix_scaled_60_40, cor = F)
beetle_pca_25_75 <- princomp(beetle_matrix_scaled_25_75, cor = F)

# Summary of components
summary(beetle_pca_60_40)
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     2.5491685 1.7429645 1.17755273 1.03562701 0.82587496
Proportion of Variance 0.4367683 0.2041884 0.09319975 0.07208763 0.04584401
Cumulative Proportion  0.4367683 0.6409567 0.73415648 0.80624411 0.85208813
                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     0.79398107 0.69803116 0.60439262 0.48952698 0.41500606
Proportion of Variance 0.04237155 0.03274942 0.02455231 0.01610673 0.01157612
Cumulative Proportion  0.89445967 0.92720909 0.95176140 0.96786813 0.97944425
                         Comp.11     Comp.12     Comp.13    Comp.14
Standard deviation     0.3879400 0.268567482 0.229002153 0.16923607
Proportion of Variance 0.0101154 0.004847981 0.003524789 0.00192504
Cumulative Proportion  0.9895596 0.994407626 0.997932415 0.99985746
                            Comp.15
Standard deviation     0.0460519786
Proportion of Variance 0.0001425445
Cumulative Proportion  1.0000000000
summary(beetle_pca_25_75)
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     2.5328448 1.7572287 1.14780618 1.13415186 0.83776193
Proportion of Variance 0.4311925 0.2075442 0.08855053 0.08645626 0.04717319
Cumulative Proportion  0.4311925 0.6387367 0.72728722 0.81374348 0.86091667
                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     0.78098378 0.67367279 0.59457160 0.44385191 0.40577794
Proportion of Variance 0.04099568 0.03050367 0.02376087 0.01324129 0.01106702
Cumulative Proportion  0.90191235 0.93241601 0.95617688 0.96941817 0.98048520
                           Comp.11     Comp.12     Comp.13     Comp.14
Standard deviation     0.372321379 0.256553466 0.220735096 0.184515579
Proportion of Variance 0.009317298 0.004423946 0.003274891 0.002288338
Cumulative Proportion  0.989802493 0.994226439 0.997501329 0.999789667
                            Comp.15
Standard deviation     0.0559405587
Proportion of Variance 0.0002103331
Cumulative Proportion  1.0000000000
# Loadings for the first 2 components
beetle_pca_60_40$loadings[, 1:2]
                  Comp.1      Comp.2
lbody         0.33899569  0.21755956
lpara         0.26533175 -0.27649837
wparaDor     -0.30305787 -0.14398110
lpen          0.06412485 -0.12384238
robust        0.32990159 -0.04701433
LparaLpen     0.25372311 -0.20541436
wparaLat      0.09404319 -0.44508854
curvpara      0.17484255 -0.38178486
posSetipunct  0.23599179 -0.34167762
mesoShape    -0.21206020 -0.28837259
lpron         0.24441816  0.01586591
lely          0.32435685  0.25681398
wpron         0.34662768  0.03994501
wely          0.33791782  0.07539477
EI            0.07843437  0.42248387

Need 7 components to explain >90% of the variance. First 2 components explain 64% of the variance.

Lastly, make the PCA plot.

factoextra::fviz_pca_ind(beetle_pca_60_40,
             col.ind = beetle_data_imp_60_40$morphotype, # color by groups
             geom = "point", # only show points
             addEllipses = T, 
             ellipse.type = "norm", # make the convex hull, other options are also possible
             legend.title = "Species",
             repel = TRUE
             )

factoextra::fviz_pca_ind(beetle_pca_25_75,
             col.ind = beetle_data_imp_25_75$morphotype, # color by groups
             geom = "point", # only show points
             addEllipses = T, 
             ellipse.type = "norm", # make the convex hull, other options are also possible
             legend.title = "Species",
             repel = TRUE
             )

Train and test a random forest model: 60-40 split

Start by training and evaluating the 60-40 split. Doing parameter tuning for the number of randomly selected predictors in each split (mtry), while keeping number of trees constant (ntree=1000). Nodesize= 1 (i.e. need at least one individual in each split). maxnode not set, meaning the tree can grow to max size. Could also tune the number of trees, but seems ok like this for this toy example.

# Tuning parameters: using 10-fold cross-validation with a grid search 
train_ctrl <- trainControl(method="cv", # type of resampling in this case Cross-Validated
                           number=10, # number of folds
                           search = "grid", # we are performing a "grid search" for tuning
                           )
set.seed(874)
model_rf_60_40 <- train(morphotype ~ .,
                       data = beetle_data_train_60_40,
                       method = "rf", # this will use the randomForest::randomForest function
                       metric = "Accuracy", # which metric should be optimized for 
                       trControl = train_ctrl,
                        tuneLength  = 10, # the number mtry to test, 10 different ones here
                       # options to be passed to randomForest
                       ntree = 1000,
                       nodesize=1,
                       keep.forest=TRUE,
                       importance=TRUE) 
model_rf_60_40
Random Forest 

76 samples
15 predictors
 4 classes: 'arcticus', 'fuscipes', 'rottenbergii', 'subrotundus' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 69, 66, 68, 69, 69, 68, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    1.0000000  1.0000000
   3    1.0000000  1.0000000
   4    1.0000000  1.0000000
   6    0.9857143  0.9805556
   7    0.9857143  0.9805556
   9    0.9857143  0.9805556
  10    0.9857143  0.9805556
  12    0.9857143  0.9805556
  13    0.9857143  0.9805556
  15    0.9857143  0.9805556

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
plot(model_rf_60_40)

Using 2 randomly selected predictors (mtry=2) during tree splits gives best accuracy.

Check variable importance next:

varImpPlot(model_rf_60_40$finalModel)

Nothing too surprising here.

Next, how are the performance and confusion matrix for train and test set:

# Train data
p_train <- predict(model_rf_60_40, beetle_data_train_60_40)
confusionMatrix(p_train, factor(beetle_data_train_60_40$morphotype))
Confusion Matrix and Statistics

              Reference
Prediction     arcticus fuscipes rottenbergii subrotundus
  arcticus           14        0            0           0
  fuscipes            0       24            0           0
  rottenbergii        0        0           17           0
  subrotundus         0        0            0          21

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9526, 1)
    No Information Rate : 0.3158     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: arcticus Class: fuscipes Class: rottenbergii
Sensitivity                   1.0000          1.0000              1.0000
Specificity                   1.0000          1.0000              1.0000
Pos Pred Value                1.0000          1.0000              1.0000
Neg Pred Value                1.0000          1.0000              1.0000
Prevalence                    0.1842          0.3158              0.2237
Detection Rate                0.1842          0.3158              0.2237
Detection Prevalence          0.1842          0.3158              0.2237
Balanced Accuracy             1.0000          1.0000              1.0000
                     Class: subrotundus
Sensitivity                      1.0000
Specificity                      1.0000
Pos Pred Value                   1.0000
Neg Pred Value                   1.0000
Prevalence                       0.2763
Detection Rate                   0.2763
Detection Prevalence             0.2763
Balanced Accuracy                1.0000
## 100% accurate on train data

# Test data
p_test <- predict(model_rf_60_40, beetle_data_test_60_40)
confusionMatrix(p_test, factor(beetle_data_test_60_40$morphotype))
Confusion Matrix and Statistics

              Reference
Prediction     arcticus fuscipes rottenbergii subrotundus
  arcticus            8        0            0           0
  fuscipes            0       15            0           1
  rottenbergii        0        0           10           0
  subrotundus         0        0            0          13

Overall Statistics
                                          
               Accuracy : 0.9787          
                 95% CI : (0.8871, 0.9995)
    No Information Rate : 0.3191          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.971           
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: arcticus Class: fuscipes Class: rottenbergii
Sensitivity                   1.0000          1.0000              1.0000
Specificity                   1.0000          0.9688              1.0000
Pos Pred Value                1.0000          0.9375              1.0000
Neg Pred Value                1.0000          1.0000              1.0000
Prevalence                    0.1702          0.3191              0.2128
Detection Rate                0.1702          0.3191              0.2128
Detection Prevalence          0.1702          0.3404              0.2128
Balanced Accuracy             1.0000          0.9844              1.0000
                     Class: subrotundus
Sensitivity                      0.9286
Specificity                      1.0000
Pos Pred Value                   1.0000
Neg Pred Value                   0.9706
Prevalence                       0.2979
Detection Rate                   0.2766
Detection Prevalence             0.2766
Balanced Accuracy                0.9643
## 1 wrong subrotundus, else all correct

Visualize the performance/predictions

# Obtain probabilities and predicted classes
probs_60_40 <- predict(model_rf_60_40, beetle_data_test_60_40, "prob")
class_60_40 <- predict(model_rf_60_40, beetle_data_test_60_40, "raw")
TEST_scored_60_40 <- cbind(beetle_data_test_60_40, probs_60_40, class_60_40)

# plot of probabilities
p1 <- ggplot(TEST_scored_60_40, aes(x = 1:nrow(TEST_scored_60_40), y = arcticus,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of arcticus")
p2 <- ggplot(TEST_scored_60_40, aes(x = 1:nrow(TEST_scored_60_40), y = rottenbergii,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of rottenbergii")
p3 <- ggplot(TEST_scored_60_40, aes(x = 1:nrow(TEST_scored_60_40), y = fuscipes,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of fuscipes")
p4 <- ggplot(TEST_scored_60_40, aes(x = 1:nrow(TEST_scored_60_40), y = subrotundus,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of subrotundus")

ggpubr:::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

# Plot classes
ggplot(TEST_scored_60_40, aes(x = 1:nrow(TEST_scored_60_40), y = class_60_40, color = morphotype)) +
    geom_point() + labs(color = "True species", x = "index", y = "Predicted species") +
    theme_bw()

Train and test a random forest model: 25-75 split

Next, I will do training and evaluating the 25-75 split. Doing parameter tuning for the number of randomly selected predictors in each split (mtry), while keeping number of trees constant (ntree=1000). Nodesize= 1 (i.e. need at least one individual in each split). maxnode not set, meaning the tree can grow to max size. Could also tune the number of trees, but seems ok like this for this toy example.

# Tuning parameters: using 10-fold cross-validation with a grid search 
train_ctrl <- trainControl(method="cv", # type of resampling in this case Cross-Validated
                           number=10, # number of folds
                           search = "grid", # we are performing a "grid search" for tuning
                           )
set.seed(35)
model_rf_25_75 <- train(morphotype ~ .,
                       data = beetle_data_train_25_75,
                       method = "rf", # this will use the randomForest::randomForest function
                       metric = "Accuracy", # which metric should be optimized for 
                       trControl = train_ctrl,
                        tuneLength  = 10, # the number mtry to test, 10 different ones here
                       # options to be passed to randomForest
                       ntree = 1000,
                       nodesize=1,
                       keep.forest=TRUE,
                       importance=TRUE) 
model_rf_25_75
Random Forest 

32 samples
15 predictors
 4 classes: 'arcticus', 'fuscipes', 'rottenbergii', 'subrotundus' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 29, 30, 30, 28, 28, 28, ... 
Resampling results across tuning parameters:

  mtry  Accuracy  Kappa    
   2    0.975     0.9666667
   3    0.975     0.9666667
   4    0.975     0.9666667
   6    0.975     0.9666667
   7    0.975     0.9666667
   9    0.975     0.9666667
  10    0.975     0.9666667
  12    0.975     0.9666667
  13    0.975     0.9666667
  15    0.975     0.9666667

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
plot(model_rf_25_75)

Using 2 randomly selected predictors (mtry=2) during tree splits gives best accuracy.

Check variable importance next:

varImpPlot(model_rf_25_75$finalModel)

Slightly different from the previous split, but ok.

Next, how are the performance and confusion matrix for train and test set:

# Train data
p_train <- predict(model_rf_25_75, beetle_data_train_25_75)
confusionMatrix(p_train, factor(beetle_data_train_25_75$morphotype))
Confusion Matrix and Statistics

              Reference
Prediction     arcticus fuscipes rottenbergii subrotundus
  arcticus            6        0            0           0
  fuscipes            0       10            0           0
  rottenbergii        0        0            7           0
  subrotundus         0        0            0           9

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8911, 1)
    No Information Rate : 0.3125     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: arcticus Class: fuscipes Class: rottenbergii
Sensitivity                   1.0000          1.0000              1.0000
Specificity                   1.0000          1.0000              1.0000
Pos Pred Value                1.0000          1.0000              1.0000
Neg Pred Value                1.0000          1.0000              1.0000
Prevalence                    0.1875          0.3125              0.2188
Detection Rate                0.1875          0.3125              0.2188
Detection Prevalence          0.1875          0.3125              0.2188
Balanced Accuracy             1.0000          1.0000              1.0000
                     Class: subrotundus
Sensitivity                      1.0000
Specificity                      1.0000
Pos Pred Value                   1.0000
Neg Pred Value                   1.0000
Prevalence                       0.2812
Detection Rate                   0.2812
Detection Prevalence             0.2812
Balanced Accuracy                1.0000
## 100% accurate on train data

# Test data
p_test <- predict(model_rf_25_75, beetle_data_test_25_75)
confusionMatrix(p_test, factor(beetle_data_test_25_75$morphotype))
Confusion Matrix and Statistics

              Reference
Prediction     arcticus fuscipes rottenbergii subrotundus
  arcticus           16        0            0           0
  fuscipes            0       28            0           0
  rottenbergii        0        0           20           0
  subrotundus         0        1            0          26

Overall Statistics
                                          
               Accuracy : 0.989           
                 95% CI : (0.9403, 0.9997)
    No Information Rate : 0.3187          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9851          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: arcticus Class: fuscipes Class: rottenbergii
Sensitivity                   1.0000          0.9655              1.0000
Specificity                   1.0000          1.0000              1.0000
Pos Pred Value                1.0000          1.0000              1.0000
Neg Pred Value                1.0000          0.9841              1.0000
Prevalence                    0.1758          0.3187              0.2198
Detection Rate                0.1758          0.3077              0.2198
Detection Prevalence          0.1758          0.3077              0.2198
Balanced Accuracy             1.0000          0.9828              1.0000
                     Class: subrotundus
Sensitivity                      1.0000
Specificity                      0.9846
Pos Pred Value                   0.9630
Neg Pred Value                   1.0000
Prevalence                       0.2857
Detection Rate                   0.2857
Detection Prevalence             0.2967
Balanced Accuracy                0.9923
## 1 wrong, else all correct

Visualize the performance/predictions

# Obtain probabilities and predicted classes
probs_25_75 <- predict(model_rf_25_75, beetle_data_test_25_75, "prob")
class_25_75 <- predict(model_rf_25_75, beetle_data_test_25_75, "raw")
TEST_scored_25_75 <- cbind(beetle_data_test_25_75, probs_25_75, class_25_75)

# plot of probabilities
p1 <- ggplot(TEST_scored_25_75, aes(x = 1:nrow(TEST_scored_25_75), y = arcticus,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of arcticus")
p2 <- ggplot(TEST_scored_25_75, aes(x = 1:nrow(TEST_scored_25_75), y = rottenbergii,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of rottenbergii")
p3 <- ggplot(TEST_scored_25_75, aes(x = 1:nrow(TEST_scored_25_75), y = fuscipes,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of fuscipes")
p4 <- ggplot(TEST_scored_25_75, aes(x = 1:nrow(TEST_scored_25_75), y = subrotundus,
    color = morphotype)) + geom_point() + labs(color = "Species", x = "index", y = "Predicted probability of subrotundus")

ggpubr:::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

# Plot classes
ggplot(TEST_scored_25_75, aes(x = 1:nrow(TEST_scored_25_75), y = class_25_75, color = morphotype)) +
    geom_point() + labs(color = "True species", x = "index", y = "Predicted species") +
    theme_bw()