-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLab1_Naive_Bayes_Classifier.Rmd
306 lines (239 loc) · 9.57 KB
/
Lab1_Naive_Bayes_Classifier.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Probability and Statistics
# Lab Assignment 1: Naive Bayes Classifier
## Work breakdown
- Viktoria Prokhorova: Data and metrics visualization
- Dmytro Shumskyi: Predict and fit methods
- Mykola Vysotskyi: Predict method, metrics functions, data
pre-processing
## Data description
- **4 - spam** This last data set contains SMS messages classified as
spam or non-spam (ham in the data set). The task is to determine
whether a given message is spam or non-spam.
Each data set consists of two files: *train.csv* and *test.csv*. The
first one you will need find the probabilities distributions for each of
the features, while the second one is needed for checking how well your
classifier works.
## **Outline of the work**
1. **Data pre-processing** (includes removing punctuation marks and
stop words, representing each message as a bag-of-words)
2. **Data visualization**
3. **Classifier implementation** (using the training set, calculate all
the conditional probabilities in formula (1) and then use those to
predict classes for messages in the testing set)
4. **Measurements of effectiveness of your classifier** (accuracy,
precision and recall curves, F1 score metric etc)
5. **Conclusions**
## Data pre-processing
```{r}
# Include all necessary libraries
library(tidytext)
library(readr)
library(dplyr)
library(ggplot2)
library(wordcloud)
library(tm)
```
```{r}
# Loading stop words
stop_words <- read_file("data/stop_words.txt")
splittedStopWords <- strsplit(stop_words, split='\n')
splittedStopWords <- splittedStopWords[[1]]
# Loading datasets
test <- read_csv("data/test.csv", show_col_types = FALSE)
train <- read_csv("data/train.csv", show_col_types = FALSE)
# Function to calculate words frequencies in given dataframe
freqDataframe <- function(dataframe, column, stop_words = NULL)
{
dataframe %>%
unnest_tokens(word, column) %>%
count(word, sort = TRUE) %>%
filter(!word %in% stop_words)
}
```
## Data visualization
```{r}
# Words for visualization
hamFreqVis <- freqDataframe(test[test$Category=="ham", ], "Message", splittedStopWords)
spamFreqVis <- freqDataframe(test[test$Category=="spam", ], "Message", splittedStopWords)
par(mfrow = c(1, 3))
wordcloud(words = hamFreqVis$word, freq=hamFreqVis$n, max.words=30, random.order = FALSE,
colors = 'green')
plot.new()
wordcloud(words = spamFreqVis$word, freq=spamFreqVis$n, max.words=30, random.order=FALSE,
colors = 'red')
```
## Classifier implementation
```{r}
# All used metrics implementations
recall <- function(y_true, y_pred) {
tp <- sum(y_true == y_pred & y_true == TRUE)
fn <- sum(y_true != y_pred & y_true == TRUE)
return (tp / (tp + fn))
}
precision <- function(y_true, y_pred) {
tp <- sum(y_true == y_pred & y_true == TRUE)
fp <- sum(y_true != y_pred & y_true == FALSE)
return (tp / (tp + fp))
}
f1Score <- function(y_true, y_pred) {
recallV <- recall(y_true, y_pred)
precisionV <- precision(y_true, y_pred)
return (2 * precisionV * recallV / (precisionV + recallV))
}
accuracy <- function(y_true, y_pred)
{
return (sum(y_true == y_pred) / length(y_true))
}
```
```{r}
naiveBayes <- setRefClass("naiveBayes",
# here it would be wise to have some vars to store intermediate result
# frequency dict etc. Though pay attention to bag of words!
fields = list(
hamFreq = "data.frame",
spamFreq = "data.frame",
hamWordsCount = "numeric",
spamWordsCount = "numeric",
hamClassProb = "numeric",
spamClassProb = "numeric"
),
methods = list(
# prepare your training data as X - bag of words for each of your
# messages and corresponding label for the message encoded as 0 or 1
# (binary classification task)
fit = function(X, y)
{
hamFreq <<- freqDataframe(X[X$Category=="ham", ], "Message", splittedStopWords)
spamFreq <<- freqDataframe(X[X$Category=="spam", ], "Message", splittedStopWords)
hamWordsCount <<- sum(hamFreq$n)
spamWordsCount <<- sum(spamFreq$n)
hamClassProb <<- nrow(X[X$Category=="ham", ]) / nrow(X)
spamClassProb <<- nrow(X[X$Category=="spam", ]) / nrow(X)
},
# return prediction for a single message
predict = function(message)
{
# Check if message is not empty
if(nchar(message) == 0) {
return (NULL)
}
# Convert message to bag of words
wrapperDf <- data.frame(word = message)
tokens <- unnest_tokens(wrapperDf, word, word) %>% filter(!word %in% splittedStopWords)
hamProb <- hamClassProb
spamProb <- spamClassProb
for (i in 1:nrow(tokens))
{
word <- tokens[i, ]
inHamCount <- ifelse(!is.na(any(hamFreq$word == word)) && any(hamFreq$word == word),
hamFreq[hamFreq$word == word, "n"]$n,
0)
inSpamCount <- ifelse(!is.na(any(spamFreq$word == word)) && any(spamFreq$word == word),
spamFreq[spamFreq$word == word, "n"]$n,
0)
hamProb <- hamProb * (inHamCount + 1) / (hamWordsCount + 2)
spamProb <- spamProb * (inSpamCount + 1) / (spamWordsCount + 2)
}
return (hamProb > spamProb)
},
# score you test set so to get the understanding how well you model
# works.
# look at f1 score or precision and recall
# visualize them
# try how well your model generalizes to real world data!
score = function(X_test, y_test)
{
y_pred <- lapply(X_test$Message, function(message) { predict(message) })
scores = c(
"accuracy" = accuracy(y_test, y_pred),
"precision" = precision(y_test, y_pred),
"recall" = recall(y_test, y_pred),
"f1Score" = f1Score(y_test, y_pred)
)
return (scores)
}
))
# Create and fit model
model = naiveBayes()
model$fit(train, train$Category)
```
## Measure effectiveness of your classifier
```{r}
# Example of score results
model$score(test, test$Category == "ham")
```
```{r}
# Get scores for different sizes
trainSizes <- c(0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1.0)
accuracyScores <- c()
precisionScores <- c()
recallScores <- c()
f1Scores <- c()
tmodel = naiveBayes()
for (trainSize in trainSizes)
{
trainSample <- train[sample(nrow(train), floor(nrow(train) * trainSize)), ]
tmodel$fit(trainSample, trainSample$Category)
scores <- tmodel$score(test, test$Category == "ham")
accuracyScores <- c(accuracyScores, scores["accuracy"])
precisionScores <- c(precisionScores, scores["precision"])
recallScores <- c(recallScores, scores["recall"])
f1Scores <- c(f1Scores, scores["f1Score"])
}
```
```{r}
df_reshaped <- data.frame(x = trainSizes,
y = c(accuracyScores, precisionScores, recallScores, f1Scores),
group = c(rep("Accuracy", length(trainSizes)),
rep("Precision", length(trainSizes)),
rep("Recall", length(trainSizes)),
rep("F1", length(trainSizes))
))
ggplot(df_reshaped, aes(x, y, col = group)) + geom_line()
```
#### Failure cases
```{r}
convert_label <- function(x) { ifelse(x == "ham", TRUE, FALSE) }
y_pred <- lapply(test$Message, function(message) { model$predict(message) })
failureCases <- test[y_pred != convert_label(test$Category), ]
failureCases
```
## Check on real world data
```{r}
model$predict("") # should return NULL
model$predict("Hello, how are you?") # should return TRUE
model$predict("WINNER!! This is the secret code to unlock the money: C3421.") # should return FALSE
```
## Conclusions
Summarize your work by explaining in a few sentences the points listed
below.
- Describe the method implemented in general. Show what are
mathematical foundations you are basing your solution on.
- List pros and cons of the method. This should include the
limitations of your method, all the assumption you make about the
nature of your data etc.
- The method is called **Naive Bayes classifier**. In our case we used
is to filter **spam** messages from **non-spam**(ham). Firstly, we
calculate frequencies of words to be in spam and ham messages.
Forming bag-of-words. After that, for given text message we can
calculate its probability to be spam/ham using **Bayes formula**. We
assume that all features(words) are **independent**. To find
probability of message belong to some class we multiply probability
of class by product of probability of each word given the class and
divide this by product of probabilities of each word(actually, we
can skip this because it is common for each class and is not useful
in comparison). Also we used **Laplace Smoothing** to prevent
probability of some word being 0.
- **Pros** of this method - easy to implement, computationally light.
**Cons** - *naivety* of this method(we assume that all words are
independent and do not consider words order), words that often
appear in one class can be in other and lead to incorrect
classification results.