Supervised learning is the branch of Machine Learning (ML) that involves predicting labels, such as 'Survived' or 'Not'. Such models learn from labelled data, which is data that includes whether a passenger survived (called "model training"), and then predict on unlabeled data.
These are generally called train and test sets because
- You want to build a model that learns patterns in the training set, and
- You then use the model to make predictions on the test set.
We can then calculate the percentage that you got correct: this is known as the accuracy of your model.
As you might already know, a good way to approach supervised learning is the following:
- Perform an Exploratory Data Analysis (EDA) on your data set;
- Build a quick and dirty model, or a baseline model, which can serve as a comparison against later models that you will build;
- Iterate this process. You will do more EDA and build another model;
- Engineer features: take the features that you already have and combine them or extract more information from them to eventually come to the last point, which is
- Get a model that performs better.
A common practice in all supervised learning is the construction and use of the train- and test- datasets. This process takes all of the input randomly splits into the two datasets (training and test); the ratio of the split is usually up to the researcher, and can be anything: 80/20, 70/30, 60/40...
There are various classifiers available:
- Decision Trees – These are organized in the form of sets of questions and answers in the tree structure.
- Naive Bayes Classifiers – A probabilistic machine learning model that is used for classification.
- K-NN Classifiers – Based on the similarity measures like distance, it classifies new cases.
- Support Vector Machines – It is a non-probabilistic binary linear classifier that builds a model to classify a case into one of the two categories.
It is a type of supervised learning algorithm. We use it for classification problems. It works for both types of input and output variables. In this technique, we split the population into two or more homogeneous sets. Moreover, it is based on the most significant splitter/differentiator in input variables.
The Decision Tree is a powerful non-linear classifier. A Decision Tree makes use of a tree-like structure to generate relationship among the various features and potential outcomes. It makes use of branching decisions as its core structure.
There are two types of decision trees:
- Categorical (classification) Variable Decision Tree: Decision Tree which has a categorical target variable.
- Continuous (Regression) Variable Decision Tree: Decision Tree has a continuous target variable.
Regression trees are used when the dependent variable is continuous while classification trees are used when the dependent variable is categorical. In continuous, a value obtained is a mean response of observation. In classification, a value obtained by a terminal node is a mode of observations.
Here, we will use the rpart
and the rpart.plot
package in order to produce and visualize a decision tree. First of all, we'll create the train and test datasets using a 70/30 ratio and a fixed seed so that we can reproduce the results.
# split into training and test subsets
set.seed(5)
ind <- sample(2, nrow(breastCancerData), replace=TRUE, prob=c(0.7, 0.3))
breastCancerData.train <- breastCancerData[ind==1,]
breastCancerData.test <- breastCancerData[ind==2,]
Now, we will load the library and create our model. We would like to create a model that predicts the Diagnosis
based on the mean of the radius and the area, as well as the SE of the texture. For ths reason we'll use the notation of myFormula <- Diagnosis ~ Radius.Mean + Area.Mean + Texture.SE
. If we wanted to create a prediction model based on all variables, we will have used myFormula <- Diagnosis ~ .
instead. Finally, minsplit
stands for the the minimum number of instances in a node so that it is split.
library(rpart)
library(rpart.plot)
myFormula <- Diagnosis ~ Radius.Mean + Area.Mean + Texture.SE
breastCancerData.model <- rpart(myFormula,
method = "class",
data = breastCancerData.train,
minsplit = 10,
minbucket = 1,
maxdepth = 3,
cp = -1)
print(breastCancerData.model$cptable)
rpart.plot(breastCancerData.model)
We see the following output and a figure:
CP nsplit rel error xerror xstd
1 0.69930070 0 1.0000000 1.0000000 0.06688883
2 0.02797203 1 0.3006993 0.3006993 0.04330166
3 0.00000000 2 0.2727273 0.3006993 0.04330166
4 -1.00000000 6 0.2727273 0.3006993 0.04330166
The parameters that we used reflect the following aspects of the model:
minsplit
: the minimum number of instances in a node so that it is splitminbucket
: the minimum allowed number of instances in each leaf of the treemaxdepth
: the maximum depth of the treecp
: parameter that controls the complexity for a split and is set intuitively (the larger its value, the more probable to apply pruning to the tree)
As we can observe, this might not be the best model. So we can select the tree with the minimum prediction error:
opt <- which.min(breastCancerData.model$cptable[, "xerror"])
cp <- breastCancerData.model$cptable[opt, "CP"]
# prune tree
breastCancerData.pruned.model <- prune(breastCancerData.model, cp = cp)
# plot tree
rpart.plot(breastCancerData.pruned.model)
table(predict(breastCancerData.pruned.model, type="class"), breastCancerData.train$Diagnosis)
The output now is the following Confusion Matrix and pruned tree:
B M
B 245 34
M 9 109
Question: What does the above "Confusion Matrix" tells you?
Now that we have a model, we should check how the prediction works in our test dataset.
## make prediction
BreastCancer_pred <- predict(breastCancerData.pruned.model, newdata = breastCancerData.test, type="class")
plot(BreastCancer_pred ~ Diagnosis, data = breastCancerData.test,
xlab = "Observed",
ylab = "Prediction")
table(BreastCancer_pred, breastCancerData.test$Diagnosis)
The new Confusion Matrix is the following:
BreastCancer_pred B M
B 102 16
M 1 53
Exercises | |
---|---|
1 | Can we improve the above model? What are the key parameters that have the most impact? |
2 | We have been using only some of the variables in our model. What is the impact of using all variables / features for our prediction? Is this a good or a bad plan? |
Random Forests is an ensemble learning technique, which essentially constructs multiple decision trees. Each tree is trained with a random sample of the training dataset and on a randomly chosen subspace. The final prediction result is derived from the predictions of all individual trees, with mean (for regression) or majority voting (for classification). The advantage is that it has better performance and is less likely to overfit than a single decision tree; however it has lower interpretability.
There are two main libraries in R that provide the functionality for Random Forest creation; the randomForest
and the party: cforest()
.
Package randomForest
- very fast
- cannot handle data with missing values
- a limit of 32 to the maximum number of levels of each categorical attribute
- extensions: extendedForest, gradientForest
Package
party: cforest()
- not limited to the above maximum levels
- slow
- needs more memory
In this exercise, we will be using the randomForest
. First, let's train the model:
library(randomForest)
rf <- randomForest(Diagnosis ~ ., data = breastCancerData.train,
ntree=100,
proximity=T)
table(predict(rf), breastCancerData.train$Diagnosis)
The output is the following:
B M
B 249 12
M 5 131
We can also investigate the content of the model:
print(rf)
The output shows the individual components and internal parameters of the Random Forest model.
Call:
randomForest(formula = Diagnosis ~ ., data = breastCancerData.train, ntree = 100, proximity = T)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 5
OOB estimate of error rate: 4.28%
Confusion matrix:
B M class.error
B 249 5 0.01968504
M 12 131 0.08391608
We can view the overall performance of the model here:
plot(rf, main = "")
We can also review which of the variables has the most "importance" (i.e. impact to the performance of the model):
importance(rf)
varImpPlot(rf)
The output is the table and the figure below:
ID 1.0244803
Radius.Mean 7.8983552
Texture.Mean 1.9614134
Perimeter.Mean 9.3502914
Area.Mean 7.3438007
Smoothness.Mean 0.7228277
Compactness.Mean 2.6595043
Concavity.Mean 11.2341661
Concave.Points.Mean 18.5940046
Symmetry.Mean 0.8989458
Fractal.Dimension.Mean 0.7465322
Radius.SE 3.1941672
Texture.SE 0.6363906
Perimeter.SE 2.4672730
Area.SE 5.3446273
Smoothness.SE 0.6089522
Compactness.SE 0.7785777
Concavity.SE 0.5576146
Concave.Points.SE 1.0314107
Symmetry.SE 0.8839428
Fractal.Dimension.SE 0.6475348
Radius.Worst 18.2035365
Texture.Worst 3.2765864
Perimeter.Worst 25.3605679
Area.Worst 17.1063000
Smoothness.Worst 2.1677456
Compactness.Worst 2.9489506
Concavity.Worst 6.0009637
Concave.Points.Worst 25.6081497
Symmetry.Worst 2.1507714
Fractal.Dimension.Worst 1.1498020
Let's try to do a prediction of the Diagnosis
for the test set, using the new model. The margin of a data point is as the proportion of votes for the correct class minus maximum proportion of votes for other classes. Positive margin means correct classification.
BreastCancer_pred_RD <- predict(rf, newdata = breastCancerData.test)
table(BreastCancer_pred_RD, breastCancerData.test$Diagnosis)
plot(margin(rf, breastCancerData.test$Diagnosis))
The output is the table and figure below:
BreastCancer_pred_RD B M
B 101 6
M 2 63
Exercises | |
---|---|
1 | ToDo |
2 | ToDo |
Linear regression is to predict response with a linear function of predictors. The most common function in R for this is lm
. In our dataset, let's try to investigate the relationship between Radius.Mean
, Concave.Points.Mean
and Area.Mean
. We can get a first impression by looking at the correlation of these variables:
## correlation between Radius.Mean and Concave.Points.Mean / Area.Mean
cor(breastCancerData$Radius.Mean, breastCancerData$Concave.Points.Mean)
## [1] 0.8225285
cor(breastCancerData$Concave.Points.Mean, breastCancerData$Area.Mean)
## [1] 0.8232689
Let's build now a linear regression model with function lm()
:
fit <- lm(breastCancerData$Radius.Mean ~ breastCancerData$Concave.Points.Mean + breastCancerData$Area.Mean)
fit
The output is the following:
Call:
lm(formula = breastCancerData$Radius.Mean ~ breastCancerData$Concave.Points.Mean +
breastCancerData$Area.Mean)
Coefficients:
(Intercept) breastCancerData$Concave.Points.Mean
7.68087 2.72493
breastCancerData$Area.Mean
0.00964
This tells us what are the coefficients of Concave.Points.Mean
and Area.Mean
, in the linear equation that connects them to Radius.Mean
. Let's see if we can predict now the mean radius of a new sample, with Concave.Points.Mean
= 0.1 and Area.Mean
= 511.2.
newRadius <- fit$coefficients[[1]] +
fit$coefficients[[2]] * 0.1 + # Concave.Points.Mean
fit$coefficients[[3]] * 511.2 # Area.Mean
newRadius
[1] 12.88132
We can also have a better look at what the model contains with summary(fit)
:
Call:
lm(formula = breastCancerData$Radius.Mean ~ breastCancerData$Concave.Points.Mean +
breastCancerData$Area.Mean)
Residuals:
Min 1Q Median 3Q Max
-4.8307 -0.1827 0.1497 0.3608 0.7411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6808702 0.0505533 151.936 <2e-16 ***
breastCancerData$Concave.Points.Mean 2.7249328 1.0598070 2.571 0.0104 *
breastCancerData$Area.Mean 0.0096400 0.0001169 82.494 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5563 on 566 degrees of freedom
Multiple R-squared: 0.9752, Adjusted R-squared: 0.9751
F-statistic: 1.111e+04 on 2 and 566 DF, p-value: < 2.2e-16
GLM, as the name implies, generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value. It unifies various other statistical models, including linear regression, logistic regression and Poisson regression. The corresponding function is glm()
; it fits generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
myFormula <-Radius.Mean ~ Area.Mean + Texture.SE
breastCancerData.glm <- glm(myFormula, family=gaussian("log"), data=breastCancerData[3:ncol(breastCancerData)])
summary(breastCancerData.glm)
The output is the following:
Call:
glm(formula = myFormula, family = gaussian("log"), data = breastCancerData[3:ncol(breastCancerData)])
Deviance Residuals:
Min 1Q Median 3Q Max
-9.4940 -0.6293 0.0726 0.7885 2.4417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.316e+00 9.817e-03 235.921 < 2e-16 ***
Area.Mean 5.307e-04 6.685e-06 79.386 < 2e-16 ***
Texture.SE -2.660e-02 6.262e-03 -4.247 2.53e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.281696)
Null deviance: 7053.95 on 568 degrees of freedom
Residual deviance: 725.44 on 566 degrees of freedom
AIC: 1761
Number of Fisher Scoring iterations: 5
We can now make prediction and visualize the result:
pred <- predict(breastCancerData.glm, type = "response")
plot(breastCancerData$Radius.Mean, pred, xlab = "Observed", ylab = "Prediction")
abline(a = 0, b = 1)
Exercises | |
---|---|
1 | Try working with additional columns as targets for the regression. |