-
Notifications
You must be signed in to change notification settings - Fork 2
/
22-breast_cancer.Rmd
312 lines (241 loc) · 12 KB
/
22-breast_cancer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# Case Study - Wisconsin Breast Cancer {#breastcancer}
This is another classification example. We have to classify breast tumors as malign or benign.
The dataset is available on the [UCI Machine learning website](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) as well as on [Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.
We have taken ideas from several blogs listed below in the reference section.
## Import the data
\index{Breast cancer dataset}
```{r breastcancer01, message=FALSE}
library(tidyverse)
df <- read_csv("dataset/BreastCancer.csv")
# This is defintely an most important step:
# Check for appropriate class on each of the variable.
glimpse(df)
```
So we have `r nrow(df)` observations with `r ncol(df)` variables. Ideally for so many variables, it would be appropriate to get a few more observations.
## Tidy the data
Basics change of variable type for the outcome variable and renaming of variables badly encoded
```{r breastcancer02}
df$diagnosis <- as.factor(df$diagnosis)
#df <- df %>% rename(concave_points_mean = `concave points_mean`,
# concave_points_se = `concave points_se`,
# concave_points_worst = `concave points_worst`)
```
As you might have noticed, in this case and the precedent we had very little to do here. This is not usually the case.
## Understand the data
This is the circular phase of our dealing with data. This is where each of the transforming, visualizing and modeling stage reinforce each other to create a better understanding.
Check for missing values
```{r breastcancer03}
map_int(df, function(.x) sum(is.na(.x)))
```
Good news, there are no missing values.
In the case that there would be many missing values, we would go on the transforming data for some appropriate imputation.
Let's check how balanced is our response variable
```{r breastcancer04}
round(prop.table(table(df$diagnosis)), 2)
```
The response variable is slightly unbalanced.
Let's look for correlation in the variables. Most ML algorithms assumed that the predictor variables are independent from each others.
Let's check for correlations. For an anlysis to be robust it is good to remove mutlicollinearity (aka remove highly correlated predictors)
\index{Correlation plot}
```{r correlation_plot, fig.height=9, fig.width=9}
df_corr <- cor(df %>% select(-id, -diagnosis))
corrplot::corrplot(df_corr, order = "hclust", tl.cex = 1, addrect = 8)
```
Indeed there are quite a few variables that are correlated. On the next step, we will remove the highly correlated ones using the `caret` package.
### Transform the data
```{r breastcancer05, message=FALSE}
library(caret)
# The findcorrelation() function from caret package remove highly correlated predictors
# based on whose correlation is above 0.9. This function uses a heuristic algorithm
# to determine which variable should be removed instead of selecting blindly
df2 <- df %>% select(-findCorrelation(df_corr, cutoff = 0.9))
#Number of columns for our new data frame
ncol(df2)
```
So our new data frame `df2` is `r ncol(df) - ncol(df2)` variables shorter.
### Pre-process the data
#### Using PCA
Let's first go on an unsupervised analysis with a PCA analysis.
To do so, we will remove the `id` and `diagnosis` variable, then we will also scale and ceter the variables.
```{r cumulative_variance}
preproc_pca_df <- prcomp(df %>% select(-id, -diagnosis), scale = TRUE, center = TRUE)
summary(preproc_pca_df)
# Calculate the proportion of variance explained
pca_df_var <- preproc_pca_df$sdev^2
pve_df <- pca_df_var / sum(pca_df_var)
cum_pve <- cumsum(pve_df)
pve_table <- tibble(comp = seq(1:ncol(df %>% select(-id, -diagnosis))), pve_df, cum_pve)
ggplot(pve_table, aes(x = comp, y = cum_pve)) +
geom_point() +
geom_abline(intercept = 0.95, color = "red", slope = 0) +
labs(x = "Number of components", y = "Cumulative Variance")
```
With the original dataset, 95% of the variance is explained with 10 PC's.
Let's check on the most influential variables for the first 2 components
```{r breastcancer06}
pca_df <- as_tibble(preproc_pca_df$x)
ggplot(pca_df, aes(x = PC1, y = PC2, col = df$diagnosis)) + geom_point()
```
It does look like the first 2 components managed to separate the diagnosis quite well. Lots of potential here.
If we want to get a more detailled analysis of what variables are the most influential in the first 2 components, we can use the `ggfortify` library.
```{r pc1vspc2, message=FALSE, warning=FALSE}
library(ggfortify)
autoplot(preproc_pca_df, data = df, colour = 'diagnosis',
loadings = FALSE, loadings.label = TRUE, loadings.colour = "blue")
```
Let's visuzalize the first 3 components.
```{r pc123_in_pairs}
df_pcs <- cbind(as_tibble(df$diagnosis), as_tibble(preproc_pca_df$x))
GGally::ggpairs(df_pcs, columns = 2:4, ggplot2::aes(color = value))
```
Let's do the same exercise with our second df, the one where we removed the highly correlated predictors.
```{r cumulative_variance2}
preproc_pca_df2 <- prcomp(df2, scale = TRUE, center = TRUE)
summary(preproc_pca_df2)
pca_df2_var <- preproc_pca_df2$sdev^2
# proportion of variance explained
pve_df2 <- pca_df2_var / sum(pca_df2_var)
cum_pve_df2 <- cumsum(pve_df2)
pve_table_df2 <- tibble(comp = seq(1:ncol(df2)), pve_df2, cum_pve_df2)
ggplot(pve_table_df2, aes(x = comp, y = cum_pve_df2)) +
geom_point() +
geom_abline(intercept = 0.95, color = "red", slope = 0) +
labs(x = "Number of components", y = "Cumulative Variance")
```
In this case, around 8 PC's explained 95% of the variance.
#### Using LDA
The advantage of using LDA is that it takes into consideration the different class.
```{r breastcancer07}
preproc_lda_df <- MASS::lda(diagnosis ~., data = df, center = TRUE, scale = TRUE)
preproc_lda_df
# Making a df out of the LDA for visualization purpose.
predict_lda_df <- predict(preproc_lda_df, df)$x %>%
as_data_frame() %>%
cbind(diagnosis = df$diagnosis)
glimpse(predict_lda_df)
```
### Model the data
Let's first create a testing and training set using `caret`
```{r breastcancer08}
set.seed(1815)
df3 <- cbind(diagnosis = df$diagnosis, df2)
df_sampling_index <- createDataPartition(df3$diagnosis, times = 1, p = 0.8, list = FALSE)
df_training <- df3[df_sampling_index, ]
df_testing <- df3[-df_sampling_index, ]
df_control <- trainControl(method="cv",
number = 15,
classProbs = TRUE,
summaryFunction = twoClassSummary)
```
#### Logistic regression
Our first model is doing logistic regression on `df2`, the data frame where we took away the highly correlated variables.
```{r breastcancer09, message=FALSE, warning=FALSE}
model_logreg_df <- train(diagnosis ~., data = df_training, method = "glm",
metric = "ROC", preProcess = c("scale", "center"),
trControl = df_control)
prediction_logreg_df <- predict(model_logreg_df, df_testing)
cm_logreg_df <- confusionMatrix(prediction_logreg_df, df_testing$diagnosis, positive = "M")
cm_logreg_df
```
#### Random Forest
Our second model uses random forest. Similarly, we using the `df2` data frame, the one where we took away the highly correlated variables.
```{r breastcancer10, message=FALSE, warning=FALSE}
model_rf_df <- train(diagnosis ~., data = df_training,
method = "rf",
metric = 'ROC',
trControl = df_control)
prediction_rf_df <- predict(model_rf_df, df_testing)
cm_rf_df <- confusionMatrix(prediction_rf_df, df_testing$diagnosis, positive = "M")
cm_rf_df
```
Let's make some diagnostic plots.
```{r randomforest_model_plot}
plot(model_rf_df)
plot(model_rf_df$finalModel)
varImpPlot(model_rf_df$finalModel, sort = TRUE,
n.var = 10, main = "The 10 variables with the most predictive power")
```
#### KNN
```{r breastcancer11}
model_knn_df <- train(diagnosis ~., data = df_training,
method = "knn",
metric = "ROC",
preProcess = c("scale", "center"),
trControl = df_control,
tuneLength =31)
plot(model_knn_df)
prediction_knn_df <- predict(model_knn_df, df_testing)
cm_knn_df <- confusionMatrix(prediction_knn_df, df_testing$diagnosis, positive = "M")
cm_knn_df
```
#### Support Vector Machine
```{r breastcancer12, message=FALSE}
set.seed(1815)
model_svm_df <- train(diagnosis ~., data = df_training, method = "svmLinear",
metric = "ROC",
preProcess = c("scale", "center"),
trace = FALSE,
trControl = df_control)
prediction_svm_df <- predict(model_svm_df, df_testing)
cm_svm_df <- confusionMatrix(prediction_svm_df, df_testing$diagnosis, positive = "M")
cm_svm_df
```
This is is an OK model.
I am wondering though if we could achieve better results with SVM when doing it on the PCA data set.
```{r breastcancer13}
set.seed(1815)
df_control_pca <- trainControl(method="cv",
number = 15,
preProcOptions = list(thresh = 0.9), # threshold for pca preprocess
classProbs = TRUE,
summaryFunction = twoClassSummary)
model_svm_pca_df <- train(diagnosis~.,
df_training, method = "svmLinear", metric = "ROC",
preProcess = c('center', 'scale', "pca"),
trControl = df_control_pca)
prediction_svm_pca_df <- predict(model_svm_pca_df, df_testing)
cm_svm_pca_df <- confusionMatrix(prediction_svm_pca_df, df_testing$diagnosis, positive = "M")
cm_svm_pca_df
```
That's already better. The treshold parameter is what we needed to play with.
#### Neural Network with LDA
To use the LDA pre-processing step, we need to also create the same training and testing set.
```{r breastcancer14}
lda_training <- predict_lda_df[df_sampling_index, ]
lda_testing <- predict_lda_df[-df_sampling_index, ]
model_nnetlda_df <- train(diagnosis ~., lda_training,
method = "nnet",
metric = "ROC",
preProcess = c("center", "scale"),
tuneLength = 10,
trace = FALSE,
trControl = df_control)
prediction_nnetlda_df <- predict(model_nnetlda_df, lda_testing)
cm_nnetlda_df <- confusionMatrix(prediction_nnetlda_df, lda_testing$diagnosis, positive = "M")
cm_nnetlda_df
```
#### Models evaluation
```{r model_evaluation_plot}
model_list <- list(logisic = model_logreg_df, rf = model_rf_df,
svm = model_svm_df, SVM_with_PCA = model_svm_pca_df,
Neural_with_LDA = model_nnetlda_df)
results <- resamples(model_list)
summary(results)
bwplot(results, metric = "ROC")
#dotplot(results)
```
The logistic has to much variability for it to be reliable. The Random Forest and Neural Network with LDA pre-processing are giving the best results.
The ROC metric measure the auc of the roc curve of each model. This metric is independent of any threshold. Let’s remember how these models result with the testing dataset. Prediction classes are obtained by default with a threshold of 0.5 which could not be the best with an unbalanced dataset like this.
```{r breastcancer15}
cm_list <- list(cm_rf = cm_rf_df, cm_svm = cm_svm_df,
cm_logisic = cm_logreg_df, cm_nnet_LDA = cm_nnetlda_df)
results <- map_df(cm_list, function(x) x$byClass) %>% as_tibble() %>%
mutate(stat = names(cm_rf_df$byClass))
results
```
The best results for sensitivity (detection of breast cases) is LDA_NNET which also has a great F1 score.
## References
A useful popular kernel on this dataset on [Kaggle](https://www.kaggle.com/lbronchal/breast-cancer-dataset-analysis)
Another one, also on [Kaggle](https://www.kaggle.com/sonicboom8/breast-cancer-data-with-logistic-randomforest)
And [another one](https://www.kaggle.com/murnix/cluster-rf-boosting-svm-accuracy-97-auc-0-96/notebook), especially nice to compare models.
\printindex{}