-
Notifications
You must be signed in to change notification settings - Fork 14
/
io101-fftrees.Rmd
488 lines (353 loc) · 19.4 KB
/
io101-fftrees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "Fast and Frugal Trees with the {FFTrees} Package"
author: "Peter Higgins"
output: html_document
editor_options:
markdown:
wrap: sentence
---
```{r setup, include=FALSE}
library(tidyverse)
library(FFTrees)
library(medicaldata)
library(broom)
library(janitor)
library(webexercises)
```
# Fast and Frugal Trees with the {FFTrees} Package
Tree-based models are one of the earliest forms of machine learning, which have some specific advantages over traditional linear or logistic regression models.
As a general rule, tree-based models handle interactions between independent predictor variables well, while traditional models struggle with interactions.
This can be helpful in biology and medicine where things like cations and electrochemical balance have inevitable interactions.
Tree-based models use a series of binary splits to create a tree structure that predicts the outcome.
The tree is built by selecting the best split at each node, based on the predictor variables.
The tree is grown until a stopping criterion is met, such as a minimum number of cases in a node, or a maximum depth of the tree.
Each split is optimized to create a more pure 'branch' of one outcome.
The tree is then pruned to avoid overfitting, and the final tree is used to make predictions on new data.
The original tree-based models were CART (Classification and Regression Trees) models, which were developed by Breiman et al. in the 1980s.
The CART algorithm is a recursive partitioning algorithm that creates binary splits in the data to create a tree structure.
The tree is grown until a stopping criterion is met, such as a minimum number of cases in a node, or a maximum depth of the tree.
The tree is then pruned to avoid overfitting, and the final tree is used to make predictions on new data.
We have since developed more complicated tree-based models, like random Forests, which start with a random subset of predictor variables and begin partitioning the observations.
This uses the entirety of the information more completely than a single tree.
A single tree is often not very accurate, but a forest of trees, called a random forest, can be very accurate.
Random forests are an ensemble method that creates many trees, each on a random subset of the data, and averages the predictions of the trees to create a more accurate prediction.
Random forests are one of the most accurate machine learning algorithms available, and are widely used in biology and medicine.
The advantage of a single tree is its transparency.
You can see where and what the splitting nodes are, and gain a better understanding of the reasons for the outcome.
These are also referred to as 'white box' models, as opposed to 'black box' models like neural networks and random Forests, which are difficult to interpret.
These are also easier to use in clinical practice.
In order to get the best of both worlds, we can use a fast and frugal tree model, which is a simple tree model that is easy to interpret and use in clinical practice.
The {FFTrees} package makes the fftrees algorithm easy to use in R.
We will work through 3 example datasets and show the use of the FFTrees package to create a simple tree model.
We will use the FFTrees package to create a simple tree model to diagnose breast cancer from a digitized biopsy slide, to identify heart disease in at-risk people, and to predict survival in tuberculosis in the strep_tb trial.
## Setup
1. Install the "FFTrees" package from CRAN.
2. Load this package with `library(FFTrees)`.
3. Load the "medicaldata" package with `library(medicaldata)`.
## The Breast Cancer Dataset
This dataset is from the UCI Machine Learning Repository, and each observation comes from a digitized image of a fine needle aspirate (FNA) of a breast mass.
The dataset contains 569 observations and 32 variables, including the diagnosis of breast cancer (malignant or benign) and 30 features that are computed from a digitized image of the FNA.
The features describe characteristics of the cell nuclei present in the image.
The dataset is available in the {FFTrees} package as `breast_cancer`, and details can be found at: [breast_cancer-FNA](https://www.kaggle.com/datasets/salihacur/breastcancerwisconsin).
An edge detection algorithm was used to identify the edges of the nuclei, and from these borders, cell and nuclear features were derived, including:
- diagnosis - TRUE (cancer) or FALSE (not cancer)
- thickness - thickness of clumps (1-10)
- cellsize.unif - uniformity of cell size (1-10)
- cellshape.unif - uniformity of cell shape (1-10)
- adhesion - a score for how much cells are adherent to each other (1-10)
- epithelial - a score for how much epithelium is present (1-10)
- nuclei.bare - a score for how many bare nuclei (without cytoplasm or cell membrane) are found (1-10)
- chromatin - a score for presence of bland chromatin (1-10)
- nucleoli - a score for presence of normal nucleoli (1-10)
- mitoses - a score for presence of mitoses (1-10)
2. Make predictions about future cases (patients) with their measured predictors on this continuous outcome.
Load the data into RStudio by copying and running the code below.
```{r}
data("breastcancer")
```
### Data Inspection
You can inspect the `breastcancer` data in the data Viewer by going to the Environment pane and clicking on it to open the dataset.
Take a look, and get a rough sense of which variables (with high scores) are associated with a TRUE diagnosis of breast cancer.
In the data Viewer, you can click on the variable `diagnosis` to sort observations to look at the predictor variable values for TRUE and FALSE cases.
Clicking on the `diagnosis` variable again will sort the observations in the reverse order.
### Check Your Progress
```{r, echo=FALSE}
opts1 <- c('200', answer = '683', '811')
opts2 <- c('1-2', answer = '1-10', '0-400')
```
1. How many observations are there in the breast cancer dataset?
`r mcq(opts1)`
2. What is the range of the `mitoses` variable?
`r mcq(opts2)`
## Building a FFTrees Model for Breast Cancer
We will build a simple tree model to predict the diagnosis of breast cancer from the features of the cell nuclei.
Let's walk through the code block below, which builds a tree model using the `FFTrees` function from the `FFTrees` package.
1. We will name the resulting model `breast.fft`, and use the assignment arrow to make this happen.
2. We will use the `FFTrees` function to build the model.
3. We will specify the first argument as the formula for the model, which is `diagnosis ~ .`, meaning we will predict the diagnosis from *all* other variables in the dataset.
4. We will specify the second argument as the data, which is `breastcancer`.
5. We will specify the `train.p` argument as 0.5, which means we will randomly use half of the data to train the model and half to test the model.
6. We will specify the `main` argument as "Breast Cancer", which will be the title of the plot.
7. We will specify the `decision.labels` argument as `c("Healthy", "Disease")`, which will be the labels for the outcome variable in the plot. This works because the underlying value for FALSE is 0, and the underlying value for TRUE is 1, so that these are in the correct order.
```{r, message=FALSE, warning=FALSE}
set.seed(123)
# the seed is used to get reproducible outcomes, as the random split into train and test will give slightly different results each time.
# note we are not using the cost argument - we will use the default for this.
breast.fft <- FFTrees(formula = diagnosis ~ .,
data = breastcancer,
train.p = 0.5,
main = "Breast Cancer",
decision.labels = c("Healthy", "Disease"))
```
Now copy this code block and run it in your local RStudio instance.
This should rank the predictor variables (`cues`), train some FFT models on the training data, rank them by their performance on the test data, and give you a short printout.
You should also see a new assigned object, `breast.fft` in your Environment pane.
To see the fft object, copy and run the code below:
```{r}
breast.fft
```
This prints out the (default) assigned cost of 1 for false positives (fa for false alarm) and false negatives (mi for miss), the definition of the best tree model (with 2 nodes or branch points), the best tree model's performance on the training data, and some accuracy statistics.
Now let's plot how this model does on the test data.
Copy and run the code below:
```{r}
plot(breast.fft, data = "test")
```
You may need to click the Zoom button at the top left of your Plots pane to get a good view.
This shows us that among the 341 randomly selected test cases (342 used for training), only 35% had TRUE breast cancer.
Then it shows model 1, which has 2 nodes, splitting on uniformity of cell size below 3, and uniformity of cell shape below 3.
You can see how each split 'purifies' the outcomes - never perfectly, but pretty well.
There is a confusion matrix at the lower left to show us false negatives and false positives - also known as hits, misses, false alarms, and correct rejections.
The model (on the Testing Set) has a sensitivity of 0.89, specificity of 0.96, and accuracy of 0.93.
The `bacc` variable is balanced accuracy, which is `sens * 0.5 + spec * 0.5`.
The mcu is `mean cues used` per observation, and pci is `percent of cues ignored`.
There is an ROC curve at the lower right comparing the model to alternative types of model, including CART, logistic regression, random forest, and support vector machine models.
They are all quite good, and sensitivity and specificity are well balanced.
In some cases, sensitivity and specificity are not well balanced.
You may place a greater negative value on missed diagnoses, and want to shift the model to be more sensitive.
You can do this by changing the cost of a miss (mi) to be higher than the cost of a false alarm (fa).
You can do this by changing the `cost` argument in the `FFTrees` function.
For example, you could set `cost = c(fa = 1, mi = 2)`.
This would mean that a missed diagnosis is twice as costly as a false alarm.
You can also change the cost of a false alarm by setting `fa = 2` and `mi = 1`.
This would mean that a false alarm is twice as costly as a missed diagnosis.
It depends on the clinical situation and your judgement, but you should think about the relative cost/harm of these two types of errors and adjust the cost argument if needed.
You can see the accuracy of each predictor by plotting these with the code below:
```{r}
plot(breast.fft, what = "cues")
```
You can see that predictor (cue) #9 is not great.
You can print these out with:
```{r}
breast.fft$cues$stats$train
```
And see that counting mitoses (#9) is not a great use of your pathologist's time, with a sensitivity of 0.425.
Several of these potential predictors are just not that useful, compared to size and shape uniformity.
You can see the best model in text (words) with:
```{r}
inwords(breast.fft, tree = 1)
```
which gives you a simple decision algorithm.
If you want to see the accuracy for all of the trees generated, you can use
```{r}
breast.fft$trees$stats$test
```
You can see the definitions of all trees with:
```{r}
breast.fft$trees$definitions
```
If sensitivity is really important, you might choose model 5 or 6, which both have a sensitivity of 1.
You can also predict the outcome for new data from the best training tree with:
```{r}
predict(breast.fft,
newdata = breastcancer[1:10,],
type = "prob")
```
This gives you the probability of FALSE (no cancre) and TRUE (cancer) for the first 10 observations in the dataset.
You can also use `type = "class"` to get the predicted class (FALSE or TRUE) for each observation, or use type = "both" if you want to see both.
<br> You can also manually control which predictors go into the model.
<br> For example, you could use only bare nuclei and bland chromatin predictors with the code below:
```{r, message=FALSE, warning=FALSE}
breast.fft <- FFTrees(formula = diagnosis ~ nuclei.bare + chromatin,
data = breastcancer,
train.p = 0.5,
main = "Breast Cancer",
decision.labels = c("Healthy", "Disease"))
plot(breast.fft, data = "test")
```
Or you can manually specify a model with the example code below:.
Copy and paste into your local RStudio instance.
```{r, error=TRUE, message=FALSE, warning=FALSE}
breast.fft <- FFTrees(diagnosis ~ .,
data = breastcancer,
my.tree = "If thickness > 6, predict TRUE. If chromatin >3, predict TRUE. Otherwise, predict FALSE.")
plot(breast.fft, data = "train")
```
## Your Turn with Heart Disease Data
We will now use one of the other datasets in the FFTrees package, heartdisease.
Load this dataset with the code below.
Copy and paste into your local RStudio instance.
Then click on the dataset in your Environment pane (it will appear as a `Promise` under Values) to see the data.
```{r}
data(heartdisease)
```
This dataset from the Cleveland Clinic has 14 variables:
- `diagnosis` - 0=healthy, 1=disease
- `age` - age in years
- `sex` - sex, 0=female, 1=male
- `cp` - chest pain type: 1=typical angina, 2=atypical angina, 3=non-anginal pain, 4=asymptomatic
- `trestbps` - resting blood pressure systolic in mmHg
- `chol` - serum cholesterol in mg/dL
- `fbs` - fasting blood sugar \> 120, 0=no, 1=yes
- `restecg` - resting electrocardiographic results, 0=normal, 1=ST-T wave abnormality, 2=left ventricular hypertrophy
- `thalach` - maximum heart rate achieved during thallium study
- `exang` - exercise induced angina, 0=no, 1=yes
- `oldpeak` - ST depression induced by exercise relative to rest
- `slope` - the slope of the peak exercise ST segment, 1=upsloping, 2=flat, 3=downsloping
- `ca` - number of major vessels open during catheterization
- `thal` - thallium study results: 3=normal, 6=fixed defect, 7=reversible defect
Take a look at which variables seem to correlate with heart disease.
You can sort by the diagnosis variable and scroll a bit to get a first impression of the data.
Now, let's build a model to predict heart disease.
We will use all of the variables in the dataset.
Copy and paste the code below into your local RStudio instance.
I have **left several black spaces** for you to fill in.
```{r, error=TRUE, message=FALSE, warning=FALSE}
set.seed(111) # note that the actual seed number does not matter, but if you change it, the randomization of train and test (and your results) will change slightly
heart.fft <- FFTrees(--- ~ .,
data = '---',
train.p = 0.5,
main = "Heart Disease",
decision.labels = c("---", "---"))
```
Fix the code, then run it.
If the model is not running, peek at the solution below.
`r hide("Click here to see the Solution")`
```{r, message=FALSE}
set.seed(111)
heart.fft <- FFTrees(diagnosis ~ .,
data = heartdisease,
train.p = 0.5,
main = "Heart Disease",
decision.labels = c("Healthy", "Heart Disease"))
```
`r unhide()`
You can then plot the model.
```{r}
plot(heart.fft, data = "test")
```
You can see the accuracy of each predictor by plotting the cues with the code below:
```{r}
plot(heart.fft, what = "cues")
```
You can see that one predictor (cue) #13 when used in isolation is almost a coin flip.
You can print these out with:
```{r}
heart.fft$cues$stats$train
```
And look at the sens (0.84) and spec (0.16) numbers to find this (order is rearranged) - this was an isolated elevated fasting blood sugar without considering symptoms, demographics, or thallium results - sensitive but not specific (note that \# 12 is an isolated EKG, and that this dataset is from the pre-Troponin (or even CK-MB) era.)
### Test what you have learned
```{r, echo=FALSE}
opts1 <- c('aa (atypical angina) = Healthy', answer = 'a (typical angina) = Heart Disease', 'ta (totally asymptomatic) = Heart Disease', 'np (non-anginal pain) = Heart Disease')
opts2 <- c(answer = '16', '23', '53')
```
1. What is the decision criterion for the first node, cp?
`r mcq(opts1)`
1. How many people of the 151 in the testing set predicted to be healthy by this model actually had heart disease?
`r mcq(opts2)`
## Your Turn to Build and Interpret a Model
We will now look at the `strep_tb` dataset from the {medicaldata} package.
This dataset has 13 variables:
- `patient_id` - a unique identifier for each patient
- `arm` - treatment arm, Control or Strep
- `dose_strep_g` - dose of Streptomycin in grams
- `gender` - M or F
- `baseline_condition` 1=Good, 2=Fair, 3=Poor
- `baseline_temp` - coded as 1-4
- `baseline_esr` - erythrocyte sedimentation rate - levels 1-4
- `baseline_cavitation` - cavitation on chest x-ray as yes or no.
- improved - FALSE (died) or TRUE (improved)
there are some other outcome variables like strep_resistance and radiologic_6m, but we will focus on improved.
However, we do not want to use these as predictors, so we will clean up the dataset a bit.
Copy and run the code below to remove some extraneous variables for our purposes today.
```{r}
strep_tb <- medicaldata::strep_tb |>
dplyr::select(-strep_resistance,-rad_num, -radiologic_6m, -dose_strep_g, -dose_PAS_g, -patient_id)
```
## Now build your FFTrees model to predict improved status (vs. death)
You will get just a starter bit of code to get you going.
Pull up the (FFTrees} package [website](https://cran.r-project.org/web/packages/FFTrees/vignettes/FFTrees_examples.html). Fix up this code and run it in your local RStudio to answer the questions below.
```{r, error=TRUE, message=FALSE, warning=FALSE}
set.seed(99)
tb.fft <- FFTrees(outcome ~ ,
data = name,
train.p = ,
main = "Streptomycin TB",
decision.labels = )
```
Fix the code, then run it.
If the model is not running, peek at the solution below.
`r hide("Click here to see the Solution")`
```{r, message=FALSE, warning=FALSE}
set.seed(99)
tb.fft <- FFTrees(improved ~ .,
data = strep_tb,
train.p = 0.5,
main = "TB Outcomes",
decision.labels = c("Died", "Improved"))
```
`r unhide()`
You can then print the model:
```{r}
# code here
```
`r hide("Show Code")`
```{r}
print(tb.fft)
```
`r unhide()`
```{r, echo=FALSE}
opts1 <- c('26', answer = '2', '5', '21')
opts2 <- c('Did not know', answer = 'Improved', 'Died')
opts3 <- c('Did not know', answer = 'Died', 'Improved')
```
1. For the best model (#1), how many patients were predicted to die, but still lived?\
`r mcq(opts1)`
2. What did the best model predict would happen to patients with very low or very high ESR?\
`r mcq(opts2)`
3. What did the best model predict would happen to patients with very high (level 4, over 38.2C) temperatures?
`r mcq(opts3)`
You can then plot the model on the test data set:
```{r}
# code here
```
`r hide("Show Code")`
```{r}
plot(tb.fft, data = "test")
```
`r unhide()`
```{r, echo=FALSE}
opts1 <- c('baseline_temp = 4 -> Died', answer = 'baseline condition = not 1 or 1 -> Died', 'baseline_esr = 4 -> Died')
opts2 <- c('9', '18', answer = '4', '22')
```
1. For the best model (#1), what is the decision criterion at the first node?\
`r mcq(opts1)`
2. By the confusion matrix on the test set, how many people predicted to improve actually died?\
`r mcq(opts2)`
<br>
You can see the accuracy of each predictor by plotting the cues with the code below:
```{r}
# code here
```
`r hide("Show Code")`
```{r, message=FALSE, warning=FALSE}
plot(tb.fft, what = "cues")
```
`r unhide()`
```{r, echo=FALSE}
opts1 <- c('baseline_temp & baseline_esr', answer = 'baseline_esr & baseline_condition', 'baseline_esr & arm')
opts2 <- c('baseline_cavitation & baseline_condition', answer = 'gender & baseline_cavitation', 'baseline_esr & baseline_temp')
```
1. What are the two best single predictors of TB outcome?\
`r mcq(opts1)`
2. What are the two worst single predictors of TB outcome?\
`r mcq(opts2)`