-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
152 lines (108 loc) · 6.99 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
library(magrittr)
library(dplyr)
library(ggplot2)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tardis: Text Analysis with Rules and Dictionaries for Inferring Sentiment (and more!)
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/tardis)](https://CRAN.R-project.org/package=tardis)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![Codecov test coverage](https://codecov.io/gh/chris31415926535/tardis/branch/main/graph/badge.svg)](https://app.codecov.io/gh/chris31415926535/tardis?branch=main)
[![R-CMD-check](https://github.com/chris31415926535/tardis/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/chris31415926535/tardis/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
TARDIS uses simple rules and dictionaries to analyze text. By
default it uses built-in dictionaries to measure sentiment, i.e. how happy
or sad text is. It handles negations, so it knows "not happy" means "sad",
and it handles modifiers, so it knows that "very happy" is more happy than
"happy". TARDIS also supports unicode emojis and multi-word tokens (so you
can tell it that "supreme court" is neutral, instead of a combination of
"supreme" (positive) and "court" (neutral). TARDIS also supports user-defined
dictionaries and can be used to analyze other constructs beyond sentiment.
## Features
* Handles ASCII and UTF-8 emojis :) 👍
* Based on simple surveyable rules
* Highly customizable
* Pretty fast, uses cpp11
## Installation
The latest stable CRAN version can be installed as follows:
```{r, eval=FALSE}
install.packages("tardis")
```
You can install the latest development version of tardis from GitHub like so:
``` {r, eval = FALSE}
devtools::install_github("chris31415926535/tardis")
```
## Example
Let's find the sentiment of a few sentences:
```{r example}
library(tardis)
text <- c("I am happy.",
"I am really happy.",
"I am really happy!",
"I am really not happy!")
tardis::tardis(text) %>%
dplyr::select(sentences, score) %>%
knitr::kable()
```
Tardis also handles blocks of text differently from other sentiment-analysis algorithms, most of which treat blocks of text as single sentences. Instead, Tardis breaks each text into individual sentences, finds their sentiment, and then returns the text's mean, standard deviation, and range. This can be helpful for finding large swings in sentiment that could indicate irony or conflict in texts that may be close to neutral overall.
```{r}
text <- "This sentence is neutral. This one is really happy! This one is absolutely miserable."
tardis::tardis(text) %>%
dplyr::select(sentences, score, score_sd, score_range) %>%
knitr::kable()
```
Or even passive-aggressive hostility, like this exchange that's neutral overall but still clearly hostile:
```{r}
text <- "Die in a fire 😘"
tardis::tardis(text) %>%
dplyr::select(sentences, score, score_sd, score_range) %>%
knitr::kable()
```
Tardis also makes it easy to use custom dictionaries, which means it can be used to measure other constructs like emotion, rank texts based on their similarity to a custom dictionary derived from a cluster analysis or LDA, or many other text-based natural language analyses.
## The algorithm in brief
Tardis first decomposes texts into tokens (words, emojis, or multi-word strings), which are scored based on the input dictionary, if they're in ALL CAPS, and the three preceding tokens. Negations like "not" will reverse and reduce a token's score, and modifiers will either increase (e.g. "very") or decrease (e.g. "slightly") its score. Sentences are scored by summing token scores, adjusting for punctuation, and scaling results (nonlinearly) so they're between -1 and 1. Text scores are means of sentence scores. Each of these steps can be tweaked or disabled by user-supplied parameters. Tardis's algorithm is inspired by other approaches, notably VADER, although it differs from this latter in three key respects: first, it is much more customizable; second, token score adjustments are all multiplicative, making the order of operations unimportant; and third, there are no special cases or exceptions, making the rules simpler and more intuitive.
## Benchmarking
The major bottlenecks have been addressed using `cpp11` so the function is reasonably fast, handling over 10,000 sentences/second using test data from `stringr::sentences`:
```{r benchmark_plot, echo = FALSE, warning=FALSE, message=FALSE}
library(ggplot2)
len = numeric()
t_median = numeric()
iters <- c(250,500, 1000, 2500, 5000, 10000)
input_text <- sample(stringr::sentences, size = 100000, replace = TRUE)
for (i in 1:length(iters) ){
iter <- iters[[i]]
r <- bench::mark(z <- tardis::tardis(input_text[1:iter], "body"))
len[[i]] <- iter
t_median[[i]] <- r$median
}
benchmark <- dplyr::tibble(length = len, time = t_median)
benchmark %>%
ggplot(aes(x=length,y=time)) + geom_line() + geom_point() +
theme_minimal() +
labs(title = "tardis::tardis() Sentences/Second",
subtitle = "Input data is random samples from stringr::sentences",
y = "Seconds",
x = "Sentences")
```
## Known issues / Possible future directions
* ACII emojis are slow to process, so the default dictionary includes only some of them.
* The default dictionary merges data from two sources, one for text and ASCII emojis and another for UTF-8 emojis, and while I've tried to normalize them it's likely possible to improve on this.
* It would be good to do more testing/validation of the default settings.
* It would be good to have suggestions for threshold positive/negative values in various scenarios.
## Similar projects and packages
* Tardis was directly inspired by [VADER](https://github.com/cjhutto/vaderSentiment), which has an R implementation on CRAN in the package [vader](https://cran.r-project.org/package=vader), and an implementation I wrote that's not on cran called [tidyvader](https://github.com/chris31415926535/tidyvader). Tardis also incorporates sentiment data from the VADER project.
* [Tidytext](https://github.com/juliasilge/tidytext) is a wonderful package for text mining in R. Tardis incorporates some sentiment data from Tidytext.
## References
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Kralj Novak P, Smailović J, Sluban B, Mozetič I (2015) Sentiment of Emojis. PLoS ONE 10(12): e0144296. https://doi.org/10.1371/journal.pone.0144296
Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), Seattle, Washington, USA, Aug 22-25, 2004.