-
Notifications
You must be signed in to change notification settings - Fork 3
/
02-data-analysis.Rmd
420 lines (295 loc) · 16.4 KB
/
02-data-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
```{r, echo=FALSE, purl=FALSE, message = FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE,comment = "#>", purl = FALSE)
```
# Analyzing Texts {#textanalysis}
> Learning Objectives
>
> - perform frequency counts and generate plots
> - use the `widyr` package to calculate co-ocurrance
> - use `igraph` and `ggraph` to plot a co-ocurrance graph
> - import and export a Document-Term Matrix into `tidytext`
> - use the `sentiments` dataset from `tidytext` to perform a sentiment analysis
------------
Now that we've read in our text and metadata, tokenized and cleaned it a little, let's move on to some analysis.
First, we'll make sure we have loaded the libraries we'll need.
```{r load-tidylibs, eval=FALSE}
library(tidyverse)
library(tidytext)
```
Let's remind ourselves of what our data looks like.
```{r}
tidy_sotu_words
```
## Frequencies
Since our unit of analysis at this point is a word, let's count to determine which words occur most frequently in the corpus as a whole.
```{r word-freq}
tidy_sotu_words %>%
count(word, sort = TRUE)
```
We can pipe this into `ggplot` to make a graph of the words that occur more that 2000 times. We count the words and use `geom_col` to represent the n values.
```{r word-freq-plot}
tidy_sotu_words %>%
count(word) %>%
filter(n > 2000) %>%
mutate(word = reorder(word, n)) %>% # reorder values by frequency
ggplot(aes(word, n)) +
geom_col(fill = "gray") +
coord_flip() # flip x and y coordinates so we can read the words better
```
Now let's look at a different question: In any given year, how often is the word 'peace' used and how often is the word 'war' used?
```{r word-years}
# steps:
# Select only the words 'war' and 'peace'.
# count ocurrences of each per year
tidy_sotu_words %>%
filter(word %in% c("war", "peace")) %>%
count(year, word)
```
Now we can plot this as a bar chart that shows for each year the proportion of each of these two words out of the total of how often both words are used.
```{r plot-word-years}
# plot n by year, and use position 'fill' to show the proportion
tidy_sotu_words %>%
filter(word %in% c("war", "peace")) %>%
count(year, word) %>%
ggplot(aes(year, n, fill = word)) +
geom_col(position = "fill")
```
As another example let us calculate the average number of words per speech for each president: How long was the average speech of each president and who are the most 'wordy' presidents?
First we summarize the words per president per speech:
```{r word-president-count}
tidy_sotu_words %>%
count(president, doc_id)
```
Then we use the output table and group it by president. That allows us to calculate the average number of words per speech.
```{r word-president-avg}
tidy_sotu_words %>%
count(president, doc_id) %>%
group_by(president) %>%
summarize(avg_words = mean(n)) %>%
arrange(desc(avg_words))
```
## Term frequency
Often a raw count of a word is less important than understanding how often that word appears *relative to the total number* of words in a text. This ratio is called the **term frequency**. We can use `dplyr` to calculate it like this:
```{r termfreq}
tidy_sotu_words %>%
count(doc_id, word, sort = T) %>% # count occurrence of word and sort descending
group_by(doc_id) %>%
mutate(n_tot = sum(n), # count total number of words per doc
term_freq = n/n_tot)
```
Let's plot the distribution of the term frequency for the speeches:
```{r termfreq-plot, message=FALSE}
tidy_sotu_words %>%
count(doc_id, word) %>% # count n for each word
group_by(doc_id) %>%
mutate(n_tot = sum(n), # count total number of words per doc
term_freq = n/n_tot) %>%
ggplot(aes(term_freq)) +
geom_histogram()
```
This distribution makes sense. Many words are used relatively rarely in a text. Only a few have a high term frequency.
Assuming that terms with high relative frequency are an indicator of significance we can find the term with the highest term frequency for each president:
```{r termfreq-president}
tidy_sotu_words %>%
count(president, word) %>% # count n for each word
group_by(president) %>%
mutate(n_tot = sum(n), # count total number of words per doc
term_freq = n/n_tot) %>%
arrange(desc(term_freq)) %>% # sort by term frequency
top_n(1) %>% # take the top for each president
print(n = Inf) # print all rows
```
>>> CHALLENGE: Pick one president. For each of his speeches, which is the term with highest term frequency? Create a table as output. (Hint: `top_n`might be useful)
```{r challenge-termfreq, eval=FALSE, echo=FALSE}
tidy_sotu_words %>%
filter(president == "Ronald Reagan") %>%
count(doc_id, word) %>%
group_by(doc_id) %>%
mutate(n_tot = sum(n),
term_freq = n/n_tot) %>%
top_n(1)
```
## Tf-idf
So far we've been looking at term frequency per document. What if we want to know about words that seem more important based on the contents of the *entire* corpus?
For this, we can use **term-frequency according to inverse document frequency**, also callled **tf-idf**. Tf-idf measures how important a word is within a corpus by scaling term frequency per document according to the inverse of the term's document frequency (number of documents within the corpus in which the term appears divided by the number of documents).
The tf-idf value will be:
- lower for words that appear frequently in many documents of the corpus, and lowest when the word occurs in virtually all documents.
- higher for words that appear frequently in just a few documents of the corpus, this lending high discriminatory power to those few documents.
The intuition here is that if a term appears frequently in a document, we think that it is important but if that word appears in too many other documents, it is not that unique and thus perhaps not that important.
The `tidytext` package includes a function `bind_tf_idf`. It takes a table that contains one-row-per-term-per-document, the name of the column that contains the words (terms), the name of the column which contains the doc-id, and the name of the column that contains the document-term counts.
So below we aggregate our tibble with the word tokens to create the one-row-per-term-per-document table and then pipe it into the `bind_tf_idf` function.
```{r tf-idf}
tidy_sotu_words %>%
count(doc_id, word, sort = TRUE) %>% # aggregate to count n for each word
bind_tf_idf(word, doc_id, n)
```
Our function added three columns to the aggregated table which contain term frequency (`tf`), inverse document frequency (`idf`) and Tf-idf (`tf_idf`).
Let's look at some of the words in the corpus that have the highest tf-idf scores, which means words that are particularly distinctive for their documents.
```{r tf-idf-sort}
tidy_sotu_words %>%
count(doc_id, word, sort = TRUE) %>%
bind_tf_idf(word, doc_id, n) %>%
arrange(desc(tf_idf))
```
To understand the occurrence of the years as being particularly distinctive we might need to look more closely at the speeches themselves, and determine whether the years are significant or whether they need to be removed from the text either permanently in the clean up or temporarily using `filter()`.
>>> CHALLENGE: Pick the same president you chose above. For each of his speeches, which is the term with highest tf-idf? Create a table as output. (Hint: Remember to group by doc_id before you use top_n)
```{r challenge-tfidf, eval=FALSE, echo=FALSE}
tidy_sotu_words %>%
filter(president == "Ronald Reagan") %>%
count(doc_id, word) %>%
bind_tf_idf(word, doc_id, n) %>%
group_by(doc_id) %>%
top_n(1)
```
## N-Grams
We mentioned n-grams in the intro, but let's revisit them here and take a look at the most common bigrams in the speeches. Remember we can use the `unnest_token()` function on our texts and explicitly tell it to generate bigrams:
```{r bigrams-unnest}
sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) # create bigram
```
Let's see the most common bigrams:
```{r bigrams-count}
sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) # count occurrences and sort descending
```
Ok, so we again need to remove the stopwords. First let us separate the two words into two columns "word1" and "word2" with `separate` from the `tidyr` package:
```{r bigrams-sep-words}
sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
```
Now we use dplyr's `filter()` function to select only the words in each column that are not in the stopwords.
```{r bigrams-remove-stop}
sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separate into cols
filter(!word1 %in% stop_words$word, # remove stopwords
!word2 %in% stop_words$word)
```
Lastly, we re-unite the two word columns into back into our bigrams and save it into a new table `sotu_bigrams`.
```{r bigrams-unite}
sotu_bigrams <- sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separate into cols
filter(!word1 %in% stop_words$word, # remove stopwords
!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ") # combine columns
sotu_bigrams %>%
count(bigram, sort = TRUE)
```
A bigram can also be treated as a term in a document in the same way that we treated individual words. That means we can look at tf-idf values in the same way. For example, we can find out the most distinct bigrams that the presidents uttered in all their respective speeches taken together.
We count per president and bigram and then bind the tf-idf value with the `bind_tf_idf` function. In order to get the top bigram for each president we then group by president, and sort and retrieve the highest value for each.
```{r bigram-tf-idf}
sotu_bigrams %>%
count(president, bigram) %>%
bind_tf_idf(bigram, president, n) %>%
group_by(president) %>%
arrange(desc(tf_idf)) %>%
top_n(1)
```
>>> CHALLENGE: Again, pick the same president you chose above. For each of his speeches, which is the bigram with highest tf-idf? Create a table as output.
```{r challenge-bigram-tfidf, eval=FALSE, echo=FALSE}
sotu_bigrams %>%
filter(president == "Ronald Reagan") %>%
count(doc_id, bigram) %>%
bind_tf_idf(bigram, doc_id, n) %>%
group_by(doc_id) %>%
top_n(1)
```
## Co-occurrence
Co-occurrences give us a sense of words that appear in the same text, but not necessarily next to each other.
For this section we will make use of the `widyr` package. The function which helps us do this is the `pairwise_count()` function. It lets us count common pairs of words co-appearing within the same speech.
Behind the scenes, this function first turns our table into a wide matrix. In our case that matrix will be made up of the individual words and the cell values will be the counts of in how many speeches they co-occur, like this:
```{r wide-matrix, echo=FALSE}
data.frame(we = c(NA, 4, 5), thus = c(4, NA, 2), have = c(5, 2, NA), row.names = c("we", "thus", "have"))
```
It then will turn the matrix back into a tidy form, where each row contains the word pairs and the count of their co-occurrence. Since we don't care about the order of the words, we will not count the upper triangle of the wide matrix, which leaves us with:
```{r tidy-matrix, echo=FALSE}
df <- data.frame(w1 = c("we", "we", "thus"), w2 = c("thus", "have", "have"), n = c(4,5,2)); names(df) <- NULL; print(df, row.names = FALSE)
```
Since processing the entire corpus would take too long here, we will only look at the last 100 words of each speech: which words occur most commonly together at the end of the speeches?
```{r pairwise-count}
library(widyr)
sotu_word_pairs <- sotu_whole %>%
mutate(speech_end = word(text, -100, end = -1)) %>% # extract last 100 words
unnest_tokens(word, speech_end) %>% # tokenize
filter(!word %in% stop_words$word) %>% # remove stopwords
pairwise_count(word, doc_id, sort = TRUE, upper = FALSE) # don't include upper triangle of matrix
sotu_word_pairs
```
To visualize the co-occurrence network of words that occur together at the end of 10 or more speeches, we use the `igraph` package to convert our table into a network graph and the `ggraph` package which adds functionality to ggplot to make it easier to plot a network.
```{r plot-network}
library(igraph)
library(ggraph)
sotu_word_pairs %>%
filter(n >= 10) %>% # only word pairs that occur 10 or more times
graph_from_data_frame() %>% #convert to graph
ggraph(layout = "fr") + # place nodes according to the force-directed algorithm of Fruchterman and Reingold
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "tomato") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
```
There are alternative approaches for this as well. See for example the `findAssocs` function in the `tm` package.
## Document-Term Matrix
A [document-term matrix (DTM)](https://en.wikipedia.org/wiki/Document-term_matrix) is a format which is frequently used in text analysis. It is a matrix where we can see the counts of each term per document. In a DTM each row represents a document, each column represents a term, and the cell values are the counts of the occurrences of the term for the particular document.
`tidytext` provides functionality to convert to and from DTMs, if for example, your analysis requires specific functions from a different R package which only works with DTM object types.
The `cast_dtm` function can be used to create a DTM object from a tidy table.
Let's assume that for some reason we want to use the `findAssoc()` function from the `tm` package.
First we use dplyr to create a table with the document name, the term, and the count.
```{r term-count}
# make a table with document, term, count
tidy_sotu_words %>%
count(doc_id, word)
```
Now we cast it as a DTM.
```{r cast-dtm}
sotu_dtm <- tidy_sotu_words %>%
count(doc_id, word) %>%
cast_dtm(doc_id, word, n)
class(sotu_dtm)
```
Finally, let's use it in the `tm` package:
```{r tm-dtm}
library(tm)
# look at the terms with tm function
Terms(sotu_dtm) %>% tail()
# most frequent terms
findFreqTerms(sotu_dtm, lowfreq = 5000)
# find terms associated with "citizen"
findAssocs(sotu_dtm, "citizen", corlimit = 0.5)
```
Conversely, `tidytext` implements the `tidy` function (originally from the `broom` package) to import DocumentTermMatrix objects. Note that it only takes the cells from the DTM that are not 0, so there will be no rows with 0 counts.
## Sentiment analysis
`tidytext` comes with a dataset `sentiments` which contains several sentiment lexicons, where each word is attributed a certain sentiment, like this:
```{r sentiments}
sentiments
```
Here we will take a look at how the sentiment of the speeches change over time. We will use the lexicon from [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), which assigns positive/negative labels for each word:
```{r bing}
bing_lex <- get_sentiments("bing")
bing_lex
```
We can use these sentiments attached to each word and join them to the words of our speeches. We will use `inner_join` from `dplyr`. It will take all rows with words from `tidy_sotu_words` that match words in `bing_lex`, eliminating rows where the word cannot be found in the lexicon. Since our columns to join on have the same name (`word`) we don't need to explicitly name it.
```{r join-sentiments}
sotu_sentiments <- tidy_sotu_words %>%
inner_join(bing_lex) # join to add semtinemt column
sotu_sentiments
```
Finally we can visualize the proportion of positive sentiment (out of the total of positive and negative) in US State of the Union Addresses over time like this:
```{r sentiment-plot, fig.width = 6, fig.height = 4, message=FALSE}
sotu_sentiments %>%
count(year, sentiment) %>% # count by year and sentiment
pivot_wider(names_from = "sentiment", values_from = "n") %>% # create column for positive
# and negative sentiment
mutate(positive_ratio = positive/(negative + positive)) %>% # calculate positive ratio
# plot
ggplot(aes(year, positive_ratio)) +
geom_line(color="gray") +
geom_smooth(span = 0.3, se = FALSE) + # smooth for easier viewing
geom_hline(yintercept = .5, linetype="dotted", color = "orange", size = 1) + # .5 as reference
scale_x_continuous(breaks = seq(1790, 2016, by = 10)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```