-
Notifications
You must be signed in to change notification settings - Fork 63
/
Ensembles_HR.Rmd
91 lines (75 loc) · 2.64 KB
/
Ensembles_HR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Illustrating Ensemble Models - Employee Attrition"
output:
html_document:
toc: yes
toc_float: yes
code_folding: hide
---
Given the potential disruption to the work environment and the required resources to attract, acquire, and train new talent, understanding factors that influence employee attrition is important to human resource departments. In this exercise, we'll explore the IBM Human Resources Analytics dataset, which contains data on employee attrition (whether an employee will leave the company). Throughout this exercise,
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(MLmetrics)
library(rsample) # contains the IBM attrition data set
```
```{r}
# Helper function to print the confusion matrix and other performance metrics of the models.
printPerformance = function(pred, actual, positive="Yes") {
print(table(actual, pred))
print("")
print(sprintf("Accuracy: %.3f", Accuracy(y_true=actual, y_pred=pred)))
print(sprintf("Precision: %.3f", Precision(y_true=actual, y_pred=pred, positive=positive)))
print(sprintf("Recall: %.3f", Recall(y_true=actual, y_pred=pred, positive=positive)))
print(sprintf("F1 Score: %.3f", F1_Score(pred, actual, positive=positive)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=pred, positive=positive)))
print(sprintf("Specificity: %.3f", Specificity(y_true=actual, y_pred=pred, positive=positive)))
}
```
# Read in the data
```{r}
df <- attrition
df = df %>%
mutate_if(is.character, as.factor)
head(df)
summary(df)
```
# Splitting the Data
```{r}
set.seed(123) # Set the seed to make it reproducible
train <- sample_frac(df, 0.8)
test <- setdiff(df, train)
actual = test$Attrition
formula = Attrition ~ .
positive = "Yes"
```
# Decision Tree
```{r, warning = FALSE}
library(rpart)
library(rpart.plot) # For pretty trees
set.seed(123)
tree <- rpart(formula, method="class", data=train)
rpart.plot(tree, extra=2, type=2)
predicted = predict(tree, test, type="class")
printPerformance(predicted, actual, positive = positive)
```
# Random Forests
```{r, warning = FALSE}
library(randomForest)
set.seed(123)
rf = randomForest(formula, data=train, mtry=3, ntree=100, importance=TRUE)
rf.predicted = predict(rf, test, type="class")
printPerformance(rf.predicted, actual, positive = positive)
importance(rf)
varImpPlot(rf)
```
# Boosting
```{r, warning = FALSE}
library(fastAdaboost)
set.seed(123)
boost = adaboost(formula, data=train, nIter=1000)
boost.predicted = predict(boost, newdata=test)
printPerformance(boost.predicted$class, actual, positive = positive)
```