-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsample_COVID_religion_BR.Rmd
208 lines (164 loc) · 9.8 KB
/
sample_COVID_religion_BR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Sample - COVID & Religion in Brazil"
author: "Gustavo Arruda"
date: "`r lubridate::today()`"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warnings = FALSE, message = FALSE)
sys.source('COVID_religion_topics.R', envir = knitr::knit_global())
theme_set(theme_minimal())
library(ggplot2)
library(RColorBrewer)
```
## Introduction
This paper contains a probabilistic topical model of online religious responses to the pandemic, with a focus on Brazilian Pentecostal-charismatic groups during the first semester of 2020. The paper was written on the occasion of a University of Chicago course called "Computation for the Social Sciences" taught in the fall of 2020. I collected the data myself while participating in the [Preaching Goes Viral](https://blogs.miamioh.edu/critical-distance/preaching-goes-viral-responses-to-the-pandemic/) project, during the summer of 2020.
- **Appendix 1** displays the code in **COVID_religion_topics.R**. Running the [COVID_religion_topics.R](https://raw.githubusercontent.com/arrudafranco/Homework-9/master/COVID_religion_topics.R) script loads the data sets and generates the topical model. This script is also necessary to produce the graphs within this paper.
- All files, data or scripts, mentioned in this paper are available at this [GitHub Repository](https://github.com/arrudafranco/Homework-9).
Used Libraries:
- To run the code in this repository, the libraries used were:
```r
library(readxl)
library(tidyverse)
library(tidytext)
library(topicmodels)
library(here)
library(tm)
library(tictoc)
library(RColorBrewer)
```
## Background
I collected the data set used for this analysis during the Summer of 2020 for the [Preaching Goes Viral](https://blogs.miamioh.edu/critical-distance/preaching-goes-viral-responses-to-the-pandemic/) project. PGV focused on archiving online religious responses to the pandemic; this data set in particular focused on Brazilian Pentecostal-charismatic churches. Pentecostalism is a strand of Protestantism originated in California in the 1910s focused on spontaneous present-day gifts of the Spirit, like speaking in tongues, exorcisms and prophesying. Today, Pentecostalism is represented by around a third of the Brazilian population; and by a third of the Brazilian Congress. The objective of this analysis is to uncover the thematic structure of this corpus, taken out of official websites and social media profiles of large denominations. To achieve that, I am using a Latent Dirichlet allocation algorithm, which creates a probabilistic topical model.
## Analysis
```{r plot1}
ggplot(top_terms, aes(term, beta, fill = topic)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
scale_fill_gradientn(colors = brewer.pal(6, "Dark2")) +
scale_x_reordered() +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
```
A first model shows the need to not only consider broader Portuguese stop words, but also stop words particular to this field. The following model takes that into account, eliminating words like "senhor", "deus" and "aleluia".
```{r plot2}
ggplot(top_terms_filtered, aes(term, beta, fill = topic)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
scale_fill_gradientn(colors = brewer.pal(6, "Dark2")) +
scale_x_reordered() +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
```
After eliminating some relevant stop words, we have a more useful topical model of this corpus.
- **Topic 3** was the first to stand out to me. It says *"psychology"*, *"patient"*, *"world"*, *"faith"* and *"anxiety"*. The topic seems to refer to mental health, with a large preponderance of mentions to "faith" and "anxiety". We cannot infer with certainty the mechanics of the association between the two words within this topic in the corpus, but we now have reasons to believe faith and anxiety were discursively presented in some proximity, either by contrast, association or even causation.
- **Topic 6** also immediately stands out, with *"health"*, *"problem"*, *"president"*, *"country"* and *"opportunism"*. Topic 6 seems to be largely associated to national politics. It is reasonable to sense a sense of dismay in relation to politics, but again the discursive mechanics is not graspable with this method. For example, "opportunism" and "president" could be used in some sort of conflation, in a criticism of Bolsonaro, or of opposition, in a criticism of challenges against him and hope for more national unity.
- **Topic 8** shows us *"temple"*, *"glory"*, *"service"*, *"heart"*, *"house"*. The underlying relationship among the different words seems to be in-person services and social distancing.
- **Topic 9** mentions the biblical king *Jehosaphat* and *"storm"*, which seems to reference the story in which said king won a war through deference to God, fasting and prayer. This topic seems to point to a particular biblical exegesis emphasizing unity and deference to authority. It is uncertain who are the authorities one should be deferential towards though: public health specialists, Bolsonaro or the pastors?
- I was not able to make sense of **Topic 1**, **Topic 2**, **Topic 4**, **Topic 5** and **Topic 7** also uncovered by this model.
## Appendix 1 - COVID_religion_topics.R
``` r
library(readxl)
library(tidyverse)
library(tidytext)
library(topicmodels)
library(here)
library(tm)
library(tictoc)
set.seed(1234)
PT_stop_words <- read_excel("PT_stop_words.xlsx") #load Portuguese stop words
COVID_Religion_Data <- read_excel("COVID_religion_data.xlsx",
col_types = c("text", "numeric", "date",
"text", "text", "text", "text", "text",
"text", "date", "text", "text"))
n_grams <- 1:5 # extract n-grams for n=1,2,3,4,5
corpus_tokens <- map_df(n_grams, ~ COVID_Religion_Data %>%
# combine title and body
unite(col = title_body, Title, `Text Data`, sep = " ") %>%
# tokenize
unnest_tokens(output = word,
input = title_body,
token = "ngrams",
n = .x) %>%
mutate(ngram = .x,
token_id = row_number()) %>%
# remove tokens that are missing values
drop_na(word))
# remove stop words or n-grams beginning or ending with stop word
corpus_stop_words <- corpus_tokens %>%
# separate ngrams into separate columns
separate(col = word,
into = c("word1", "word2", "word3", "word4", "word5"),
sep = " ") %>%
# find last word
mutate(last = if_else(ngram == 5, word5,
if_else(ngram == 4, word4,
if_else(ngram == 3, word3,
if_else(ngram == 2, word2, word1))))) %>%
# remove tokens where the first or last word is a stop word
filter(word1 %in% PT_stop_words$word |
last %in% PT_stop_words$word) %>%
select(ngram, token_id)
# convert to dtm
corpus_dtm <- corpus_tokens %>%
# remove stop word tokens
anti_join(corpus_stop_words) %>%
# get count of each token in each document
count(id, word) %>%
# create a document-term matrix with all features and tf weighting
cast_dtm(document = id, term = word, value = n) %>%
removeSparseTerms(sparse = .999)
## Joining, by = c("ngram", "token_id")
# remove documents with no terms remaining
corpus_dtm <- corpus_dtm[unique(corpus_dtm$i),]
corpus_lda12 <- LDA(corpus_dtm, k = 9, control = list(seed = 1234))
# A LDA_VEM topic model with 9 topics.
corpus_lda12_td <- tidy(corpus_lda12)
top_terms <- corpus_lda12_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(topic = factor(topic),
term = reorder_within(term, beta, topic))
# removing stop words characteristic of this topic
topical_stop_words <- data.frame(word = c("senhor", "deus", "igrejas", "igreja", "aleluia",
"glória", "jesus", "verso"))
full_stop_words <- union(PT_stop_words, topical_stop_words)
# remove stop words or n-grams beginning or ending with stop word
corpus_full_stop_words <- corpus_tokens %>%
# separate ngrams into separate columns
separate(col = word,
into = c("word1", "word2", "word3", "word4", "word5"),
sep = " ") %>%
# find last word
mutate(last = if_else(ngram == 5, word5,
if_else(ngram == 4, word4,
if_else(ngram == 3, word3,
if_else(ngram == 2, word2, word1))))) %>%
# remove tokens where the first or last word is a stop word
filter(word1 %in% full_stop_words$word |
last %in% full_stop_words$word) %>%
select(ngram, token_id)
# convert to dtm
corpus_filtered_dtm <- corpus_tokens %>%
# remove stop word tokens
anti_join(corpus_full_stop_words) %>%
# get count of each token in each document
count(id, word) %>%
# create a document-term matrix with all features and tf weighting
cast_dtm(document = id, term = word, value = n) %>%
removeSparseTerms(sparse = .999)
## Joining, by = c("ngram", "token_id")
# remove documents with no terms remaining
corpus_filtered_dtm <- corpus_filtered_dtm[unique(corpus_filtered_dtm$i),]
corpus_filtered_lda12 <- LDA(corpus_filtered_dtm, k = 9, control = list(seed = 1234))
# A LDA_VEM topic model with 9 topics.
corpus_filtered_lda12_td <- tidy(corpus_filtered_lda12)
top_terms_filtered <- corpus_filtered_lda12_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms_filtered %>%
mutate(topic = factor(topic),
term = reorder_within(term, beta, topic))
```