-
Notifications
You must be signed in to change notification settings - Fork 2
/
28-mushroom.Rmd
339 lines (263 loc) · 14.9 KB
/
28-mushroom.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# Case Study - Mushrooms Classification {#mushroom}
This example demonstrates how to classify muhsrooms as edible or not. It also answer the question: what are the main characteristics of an edible mushroom?
[This blog post](https://stoltzmaniac.com/random-forest-classification-of-mushrooms/) gave us first the idea and we followed most of it. We also noticed that Kaggle has put online the same data set and classification exercise. We have taken inspiration from some posts [here](https://www.kaggle.com/abhishekheads/d/uciml/mushroom-classification/walk-through-of-different-classification-models) and [here](https://www.kaggle.com/jhuno137/d/uciml/mushroom-classification/classification-tree-using-rpart-100-accuracy)
The data set is available on the [Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Mushroom) of the UC Irvine website.
## Import the data
The data set is given to us in a rough form and quite a bit of editing is necessary.
```{r message=FALSE}
# Load the data - we downloaded the data from the website and saved it into a .csv file
mushroom <- read_csv("dataset/Mushroom.csv", col_names = FALSE)
glimpse(mushroom)
```
Basically we have `r nrow(mushroom)` mushrooms in the dataset. And each observation consists of `r ncol(mushroom)` variables. As it stands, the data frame doesn't look very meaningfull. We have to go back to the source to bring meaning to each of the variables and to the various levels of the categorical variables.
## Tidy the data
This is the least fun part of the workflow.
We'll start by giving names to each of the variables, then we specify the category for each variable.
It is not necessary to do so but it does add meaning to what we do.
```{r message=FALSE}
# Rename the variables
colnames(mushroom) <- c("edibility", "cap_shape", "cap_surface",
"cap_color", "bruises", "odor",
"gill_attachement", "gill_spacing", "gill_size",
"gill_color", "stalk_shape", "stalk_root",
"stalk_surface_above_ring", "stalk_surface_below_ring", "stalk_color_above_ring",
"stalk_color_below_ring", "veil_type", "veil_color",
"ring_number", "ring_type", "spore_print_color",
"population", "habitat")
# Defining the levels for the categorical variables
## We make each variable as a factor
mushroom <- mushroom %>% map_df(function(.x) as.factor(.x))
## We redefine each of the category for each of the variables
levels(mushroom$edibility) <- c("edible", "poisonous")
levels(mushroom$cap_shape) <- c("bell", "conical", "flat", "knobbed", "sunken", "convex")
levels(mushroom$cap_color) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$cap_surface) <- c("fibrous", "grooves", "scaly", "smooth")
levels(mushroom$bruises) <- c("no", "yes")
levels(mushroom$odor) <- c("almond", "creosote", "foul", "anise", "musty", "none", "pungent", "spicy", "fishy")
levels(mushroom$gill_attachement) <- c("attached", "free")
levels(mushroom$gill_spacing) <- c("close", "crowded")
levels(mushroom$gill_size) <- c("broad", "narrow")
levels(mushroom$gill_color) <- c("buff", "red", "gray", "chocolate", "black", "brown", "orange",
"pink", "green", "purple", "white", "yellow")
levels(mushroom$stalk_shape) <- c("enlarging", "tapering")
levels(mushroom$stalk_root) <- c("missing", "bulbous", "club", "equal", "rooted")
levels(mushroom$stalk_surface_above_ring) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$stalk_surface_below_ring) <- c("fibrous", "silky", "smooth", "scaly")
levels(mushroom$stalk_color_above_ring) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$stalk_color_below_ring) <- c("buff", "cinnamon", "red", "gray", "brown", "pink",
"green", "purple", "white", "yellow")
levels(mushroom$veil_type) <- "partial"
levels(mushroom$veil_color) <- c("brown", "orange", "white", "yellow")
levels(mushroom$ring_number) <- c("none", "one", "two")
levels(mushroom$ring_type) <- c("evanescent", "flaring", "large", "none", "pendant")
levels(mushroom$spore_print_color) <- c("buff", "chocolate", "black", "brown", "orange",
"green", "purple", "white", "yellow")
levels(mushroom$population) <- c("abundant", "clustered", "numerous", "scattered", "several", "solitary")
levels(mushroom$habitat) <- c("wood", "grasses", "leaves", "meadows", "paths", "urban", "waste")
```
Let's check our changes one last time before diving into in the next phase of our data analysis workflow.
```{r}
glimpse(mushroom)
```
As each variables is categorical, let's see how many categories are we speaking about?
```{r}
number_class <- function(x){
x <- length(levels(x))
}
x <- mushroom %>% map_dbl(function(.x) number_class(.x)) %>% as_tibble() %>%
rownames_to_column() %>% arrange(desc(value))
colnames(x) <- c("Variable name", "Number of levels")
print(x)
```
## Understand the data
This is the circular phase of our dealing with data. This is where each of the transforming, visualizing and modeling stage reinforce each other to create a better understanding.
![data workflow](otherpics/data_workflow.png)
### Transform the data
We noticed from the previous section an issue with the veil_type variable. It has only one factor. So basically, it does not bring any information. Furthermore, factor variable with only one level do create issues later on at the modeling stage. R will throw out an error for the categorical variable that has only one level.
So let's take away that column.
```{r}
mushroom <- mushroom %>% select(- veil_type)
```
Do we have any missing data? Most ML algorithms won't work if we have missing data.
```{r}
map_dbl(mushroom, function(.x) {sum(is.na(.x))})
```
Lucky us! We have no missing data.
### Visualize the data
This is one of the most important step in the DS process. This stage can gives us unexpected insights and often allows us to ask the right questions.
```{r mushroom_pic1, message=FALSE}
library(ggplot2)
ggplot(mushroom, aes(x = cap_surface, y = cap_color, col = edibility)) +
geom_jitter(alpha = 0.5) +
scale_color_manual(breaks = c("edible", "poisonous"),
values = c("green", "red"))
```
If we want to stay safe, better bet on *fibrous* surface. Stay especially away from *smooth* surface, except if they are purple or green.
```{r mushroom_pic2}
ggplot(mushroom, aes(x = cap_shape, y = cap_color, col = edibility)) +
geom_jitter(alpha = 0.5) +
scale_color_manual(breaks = c("edible", "poisonous"),
values = c("green", "red"))
```
Again, in case one don't know about mushroom, it is better to stay away from all shapes except maybe for *bell* shape mushrooms.
```{r mushroom_pic3}
ggplot(mushroom, aes(x = gill_color, y = cap_color, col = edibility)) +
geom_jitter(alpha = 0.5) +
scale_color_manual(breaks = c("edible", "poisonous"),
values = c("green", "red"))
ggplot(mushroom, aes(x = edibility, y = odor, col = edibility)) +
geom_jitter(alpha = 0.5) +
scale_color_manual(breaks = c("edible", "poisonous"),
values = c("green", "red"))
```
Odor is defintely quite an informative predictor. Basically, if it smells *fishy*, *spicy* or *pungent* just stay away. If it smells like *anise* or *almond* you can go ahead. If it doesn't smell anything, you have better chance that it is edible than not.
TO DO: put a comment on what we see
TO DO: put a mosaic graph
### Modeling
At this stage, we should have gathered enough information and insights on our data to choose appropriate modeling techniques.
Before we go ahead, we need to split the data into a training and testing set
```{r mush01}
set.seed(1810)
mushsample <- caret::createDataPartition(y = mushroom$edibility, times = 1, p = 0.8, list = FALSE)
train_mushroom <- mushroom[mushsample, ]
test_mushroom <- mushroom[-mushsample, ]
```
We can check the quality of the splits in regards to our predicted (dependent) variable.
```{r mush02}
round(prop.table(table(mushroom$edibility)), 2)
round(prop.table(table(train_mushroom$edibility)), 2)
round(prop.table(table(test_mushroom$edibility)), 2)
```
It seems like we have the right splits.
#### Use of Regression Tree
As we have many categorical variables, regression tree is an ideal classification tools for such situation.
We'll use the `rpart` package. Let's give it a try without any customization.
```{r mush03}
library(rpart)
library(rpart.plot)
set.seed(1810)
model_tree <- rpart(edibility ~ ., data = train_mushroom, method = "class")
model_tree
```
```{r mush04}
caret::confusionMatrix(data=predict(model_tree, type = "class"),
reference = train_mushroom$edibility,
positive="edible")
```
We have quite an issue here. 40 mushrooms have been predicted as edible but were actually poisonous. That should not be happening. So we'll set up a penalty for wrongly predicting a mushroom as `edible` when in reality it is `poisonous`. A mistake the other way is not as bad. At worst we miss on a good recipe! So let's redo our tree with a penalty for wrongly predicting poisonous. To do this, we introduce a penalty matrix that we'll use as a parameter in our rpart function.
```{r mush05}
penalty_matrix <- matrix(c(0, 1, 10, 0), byrow = TRUE, nrow = 2)
model_tree_penalty <- rpart(edibility ~ ., data = train_mushroom, method = "class",
parms = list(loss = penalty_matrix))
caret::confusionMatrix(data=predict(model_tree_penalty, type = "class"),
reference = train_mushroom$edibility,
positive="edible")
```
So introducing a penalty did the job; it gave us a perfect prediction and saves us from a jounrey at the hospital.
Another way to increase the accuracy of our tree model is to play on the `cp` parameter.
We start to build a tree with a very low `cp` (that is we'll have a deep tree). The idea is that we then prune it later.
```{r mush06}
model_tree <- rpart(edibility ~ ., data = train_mushroom,
method = "class", cp = 0.00001)
```
To prune a tree, we first have to find the `cp` that gives the lowest `xerror` or cross-validation error. We can find the lowest `xerror` using either the `printcp` or `plotcp` function.
```{r}
printcp(model_tree)
```
We can see here that that the lowest `xerror` happen at the 5th split.
```{r mushroom_xvalplot1}
plotcp(model_tree)
```
```{r}
model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]
```
So now we can start pruning our tree with the `cp` that gives the lowest cross-validation error.
```{r}
bestcp <- round(model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"], 4)
model_tree_pruned <- prune(model_tree, cp = bestcp)
```
Let's have a quick look at the tree as it stands
```{r mushroom_treeplot1}
rpart.plot(model_tree_pruned, extra = 104, box.palette = "GnBu",
branch.lty = 3, shadow.col = "gray", nn = TRUE)
```
How does the model perform on the train data?
```{r}
#table(train_mushroom$edibility, predict(model_tree, type="class"))
caret::confusionMatrix(data=predict(model_tree_pruned, type = "class"),
reference = train_mushroom$edibility,
positive="edible")
```
It seems like we have a perfect accuracy on our training set. It is quite rare to have such perfect accuracy.
Let's check how it fares on the testing set.
```{r}
test_tree <- predict(model_tree, newdata = test_mushroom)
caret::confusionMatrix(data = predict(model_tree, newdata = test_mushroom, type = "class"),
reference = test_mushroom$edibility,
positive = "edible")
```
Perfect prediction here as well.
#### Use of Random Forest
We usually use random forest if a tree is not enough. In this case, as we have perfect prediction using a single tree, it is not really necessary to use a Random Forest algorithm. We just use for learning sake without tuning any of the parameters.
```{r rf_errorplot1, message=FALSE}
library(randomForest)
yo <- train_mushroom %>% na.omit()
model_rf <- randomForest(edibility ~ ., data = yo, importance = TRUE)
plot(model_rf)
```
The default number of trees for the random forest is 500; we just use 50 here. As we can see on the plot, above 20 trees, the error isn't decreasing anymore. And actually, the error seems to be 0 or almost 0.
The next step can tell us this more accurately.
```{r}
print(model_rf)
```
Altough it is not really necessary to this here as we have a perfect prediction, we can use the `confusionMatrix` function from the `caret` pacakge.
```{r}
caret::confusionMatrix(data = model_rf$predicted, reference = yo$edibility ,
positive = "edible")
```
If we want to look at the most important variable in terms of predicting edibility in our model, we can do that using the *Mean Decreasing Gini*
```{r rf_importance_variable}
varImpPlot(model_rf, sort = TRUE,
n.var = 10, main = "The 10 variables with the most predictive power")
```
Another way to look at the predictible power of the variables is to use the `importance` extractor function.
```{r}
library(tibble)
randomForest::importance(model_rf) %>% data.frame() %>%
rownames_to_column(var = "Variable") %>%
arrange(desc(MeanDecreaseGini)) %>%
head(10)
```
We could compare that with the important variables from the classification tree obtained above.
```{r}
model_tree_penalty$variable.importance %>%
as_tibble() %>% rownames_to_column(var = "variable") %>%
arrange(desc(value)) %>% head(10)
```
Interestingly gill_size which is the 5th most important predictor in the random forest does not appear in the top 10 of our classification tree.
Now we apply our model to our testing set.
```{r}
test_rf <- predict(model_rf, newdata = test_mushroom)
# Quick check on our prediction
table(test_rf, test_mushroom$edibility)
```
Perfect Prediction!
#### Use of SVM
```{r warning=FALSE, message=FALSE}
library(e1071)
model_svm <- svm(edibility ~. , data=train_mushroom, cost = 1000, gamma = 0.01)
```
Check the prediction
```{r}
test_svm <- predict(model_svm, newdata = test_mushroom %>% na.omit())
yo <- test_mushroom %>% na.omit()
table(test_svm, yo$edibility)
```
And perfect prediction again!
## Communication
With some fine tuning, a regression tree managed to predict accurately the edibility of mushroom. They were 2 parameters to look at: the `cp`and the penalty matrix.
Random Forest and SVM achieved similar results out of the box.
The regression tree approach has to be prefered as it is a lot easier to grasp the results from a tree than from a SVM algorithm.
For sure I will take my little tree picture next time I go shrooming. That said, I will still only go with a good mycologist.