-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathtopicmodeling_overview.Rmd
61 lines (53 loc) · 2.53 KB
/
topicmodeling_overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Topic modeling overview
Topic modeling is a way to automatically analyze a set of documents and uncover themes or topics within them.
It's not as clever as a human reader, of course, but I think you'll find that topic modeling produces surprisingly interpretable results!
## What are topics?
To run a topic modeling analysis, you need a collection of documents.
Typically, you'll do some pre-processing of those documents to facilitate the discovery of topics.
There are often three steps:
1) removes some common, uninformative words (e.g. "the", "and"), called *stop words*,
2) remove words that only occur in very few documents (a typical threshold is that each word must in at least 5 documents, but you can move that up or down depending on your analysis), and 3) make different forms of the same word the same (e.g. "see", "sees", "seeing"), a process called *stemming*.
This leaves you with a set of cleaned documents, and a list of all of the words left in those documents (the "vocabulary"). Each topic is a distribution over the vocabulary. Here's a toy example, so you can visualize what I mean:
```{r topic_viz}
library(tidyr); library(ggplot2)
vocab <- c("garlic", "hard", "cook", "collar", "measure", "write", "cup", "knot", "suit", "edit")
topic1 <- data.frame(topic=1, vocabulary=c(rep(1,5),
rep(2, 1),
rep(3, 4),
rep(4, 0),
rep(5, 3),
rep(6, 0),
rep(7, 4),
rep(8, 0),
rep(9, 0),
rep(10, 0)))
topic2 <- data.frame(topic=2, vocabulary=c(rep(1,0),
rep(2, 0),
rep(3, 0),
rep(4, 4),
rep(5, 3),
rep(6, 0),
rep(7, 0),
rep(8, 4),
rep(9, 4),
rep(10, 0)))
topic3 <- data.frame(topic=3, vocabulary=c(rep(1,0),
rep(2, 4),
rep(3, 0),
rep(4, 0),
rep(5, 0),
rep(6, 4),
rep(7, 0),
rep(8, 1),
rep(9, 0),
rep(10, 3)))
topic.data <- rbind(topic1, topic2, topic3)
topic.data$topic <- as.factor(topic.data$topic)
ggplot(topic.data, aes(x=vocabulary)) +
geom_density(aes(fill=topic, color=topic), alpha=.5, adjust = 1/3) +
scale_x_continuous(breaks=1:10, labels=vocab) +
theme(axis.text.y=element_blank(), axis.ticks=element_blank()) +
labs(y=NULL)
```
How to use the stm package (from the [stm vignette](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf)):
![stm workflow](/Users/TARDIS/Dropbox/RClub/stm_package_workflow.png)