forked from trinker/textstem
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
192 lines (132 loc) · 6.49 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "textstem"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---
```{r, echo=FALSE, warning=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
verbadge <- ''
pacman::p_load(textstem)
pacman::p_load_current_gh('trinker/numform')
nr <- numform::f_comma(length(presidential_debates_2012$dialogue))
nw <- numform::f_comma(sum(stringi::stri_count_words(presidential_debates_2012$dialogue), na.rm = TRUE))
````
[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/textstem.svg?branch=master)](https://travis-ci.org/trinker/textstem)
[![Coverage Status](https://coveralls.io/repos/trinker/textstem/badge.svg?branch=master)](https://coveralls.io/r/trinker/textstem?branch=master)
[![](https://cranlogs.r-pkg.org/badges/textstem)](https://cran.r-project.org/package=textstem)
`r verbadge`
![](tools/textstem_logo/r_textstem.png)
**textstem** is a tool-set for stemming and lemmatizing words. Stemming is a process that removes affixes. Lemmatization is the process of grouping inflected forms together as a single base form.
# Functions
The main functions, task category, & descriptions are summarized in the table below:
| Function | Task | Description |
|-------------------------------|-------------|--------------------------------------------|
| `stem_words` | stemming | Stem words |
| `stem_strings` | stemming | Stem strings |
| `lemmatize_words` | lemmatizing | Lemmatize words |
| `lemmatize_strings` | lemmatizing | Lemmatize strings |
| `make_lemma_dictionary_words` | lemmatizing | Generate a dictionary of lemmas for a text |
# Installation
To download the development version of **textstem**:
Download the [zip ball](https://github.com/trinker/textstem/zipball/master) or [tar ball](https://github.com/trinker/textstem/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textstem")
```
# Contact
You are welcome to:
- submit suggestions and bug-reports at: <https://github.com/trinker/textstem/issues>
- send a pull request on: <https://github.com/trinker/textstem/>
- compose a friendly e-mail to: <[email protected]>
# Examples
The following examples demonstrate some of the functionality of **textstem**.
## Load the Tools/Data
```{r, message=FALSE, warning=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(textstem, dplyr)
data(presidential_debates_2012)
```
## Stemming Versus Lemmatizing
Before moving into the meat these two examples let's highlight the difference between stemming and lemmatizing.
### "Drive" Stemming vs. Lemmatizing
```{r}
dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')
stem_words(dw)
lemmatize_words(dw)
```
### "Be" Stemming vs. Lemmatizing
```{r}
bw <- c('are', 'am', 'being', 'been', 'be')
stem_words(bw)
lemmatize_words(bw)
```
## Stemming
Stemming is the act of removing inflections from a word not necessarily ["identical to the morphological root of the word" (wikipedia)](https://en.wikipedia.org/wiki/Stemming). Below I show stemming of several small strings.
```{r}
y <- c(
'the dirtier dog has eaten the pies',
'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag",
'There are skies of blue and red roses too!',
NA,
"The doggies, well they aren't joyfully running.",
"The daddies are coming over...",
"This is 34.546 above"
)
stem_strings(y)
```
## Lemmatizing
### Default Lemma Dictionary
Lemmatizing is the ["grouping together the inflected forms of a word so they can be analysed as a single item" (wikipedia)](https://en.wikipedia.org/wiki/Lemmatisation). In the example below I reduce the strings to their lemma form. `lemmatize_strings` uses a lookup dictionary. The default uses [Mechura's (2016)](http://www.lexiconista.com) English lemmatization list available from the [**lexicon**](https://cran.r-project.org/package=lexicon) package. The `make_lemma_dictionary` function contains two additional engines for generating a lemma lookup table for use in `lemmatize_strings`.
```{r}
y <- c(
'the dirtier dog has eaten the pies',
'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag",
'There are skies of blue and red roses too!',
NA,
"The doggies, well they aren't joyfully running.",
"The daddies are coming over...",
"This is 34.546 above"
)
lemmatize_strings(y)
```
### Hunspell Lemma Dictionary
This lemmatization uses the [**hunspell**](https://CRAN.R-project.org/package=hunspell) package to generate lemmas.
```{r}
lemma_dictionary_hs <- make_lemma_dictionary(y, engine = 'hunspell')
lemmatize_strings(y, dictionary = lemma_dictionary_hs)
```
### koRpus Lemma Dictionary
This lemmatization uses the [**koRpus**](https://CRAN.R-project.org/package=koRpus) package and the [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) program to generate lemmas. You'll have to get TreeTagger set up, preferably in your machine's root directory.
```{r}
lemma_dictionary_tt <- make_lemma_dictionary(y, engine = 'treetagger')
lemmatize_strings(y, lemma_dictionary_tt)
```
### Lemmatization Speed
It's pretty fast too. Observe:
```{r}
tic <- Sys.time()
presidential_debates_2012$dialogue %>%
lemmatize_strings() %>%
head()
(toc <- Sys.time() - tic)
```
That's `r nr` rows of text, or `r nw` words, in `r round(as.numeric(toc), 2)` seconds.
## Combine With Other Text Tools
This example shows how stemming/lemmatizing might be complemented by other text tools such as `replace_contraction` from the **textclean** package.
```{r}
library(textclean)
'aren\'t' %>%
lemmatize_strings()
'aren\'t' %>%
textclean::replace_contraction() %>%
lemmatize_strings()
```