-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdata_structures_intro.Rmd
310 lines (230 loc) · 7.86 KB
/
data_structures_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "R Data Structures"
---
As in many programming languages, understanding how data are stored and manipulated is important to getting the most out of the experience. In these next few sections, we will introduce some basic R data types and structures as well as some general approaches for working with them.
# Vectors
In R, even a single value is a vector with length=1.
```{r}
z = 1
z
length(z)
```
In the code above, we "assigned" the value 1 to the variable named `z`. Typing `z` by itself is an "expression" that returns a result which is, in this case, the value that we just assigned. The `length` method takes an R object and returns the R length. There are numerous ways of asking R about what an object represents, and `length` is one of them.
Vectors can contain numbers, strings (character data), or logical values (`TRUE` and `FALSE`) or other "atomic" data types (table \@ref(tab:simpletypes)). *Vectors cannot contain a mix of types!* We will introduce another data structure, the R `list` for situations when we need to store a mix of base R data types.
Table: (\#tab:simpletypes) Atomic (simplest) data types in R.
Data type Stores
----------- ------------------------
numeric floating point numbers
integer integers
complex complex numbers
factor categorical data
character strings
logical TRUE or FALSE
NA missing
NULL empty
function function type
## Creating vectors
Character vectors (also sometimes called "string" vectors) are entered with each value
surrounded by single or double quotes; either is acceptable, but they
must match. They are always displayed by R with double quotes. Here are some examples of creating vectors:
```{r}
# examples of vectors
c('hello','world')
c(1,3,4,5,1,2)
c(1.12341e7,78234.126)
c(TRUE,FALSE,TRUE,TRUE)
# note how in the next case the TRUE is converted to "TRUE"
# with quotes around it.
c(TRUE,'hello')
```
We can also create vectors as "regular sequences" of numbers. For example:
```{r}
# create a vector of integers from 1 to 10
x = 1:10
# and backwards
x = 10:1
```
The `seq` function can create more flexible regular sequences.
```{r}
# create a vector of numbers from 1 to 4 skipping by 0.3
y = seq(1,4,0.3)
```
And creating a new vector by concatenating existing vectors is possible, as well.
```{r}
# create a sequence by concatenating two other sequences
z = c(y,x)
z
```
## Vector Operations
Operations on a single vector are typically done element-by-element. For example, we can add `2` to a vector, `2` is added to each element of the vector and a new vector of the same length is returned.
```{r}
x = 1:10
x + 2
```
If the operation involves two vectors, the following rules apply. If the vectors are the same length: R simply applies the operation to each pair of elements.
```{r}
x + x
```
If the vectors are different lengths, but one length a multiple of the other, R
reuses the shorter vector as needed.
```{r}
x = 1:10
y = c(1,2)
x * y
```
If the vectors are different lengths, but one length *not* a multiple of the
other, R reuses the shorter vector as needed *and* delivers a
warning.
```{r}
x = 1:10
y = c(2,3,4)
x * y
```
Typical operations include multiplication ("\*"), addition,
subtraction, division, exponentiation ("\^"), but many operations
in R operate on vectors and are then called "vectorized".
## Logical Vectors
Logical vectors are vectors composed on only the values `TRUE` and
`FALSE`. Note the all-upper-case and no quotation marks.
```{r}
a = c(TRUE,FALSE,TRUE)
# we can also create a logical vector from a numeric vector
# 0 = false, everything else is 1
b = c(1,0,217)
d = as.logical(b)
d
# test if a and d are the same at every element
all.equal(a,d)
# We can also convert from logical to numeric
as.numeric(a)
```
### Logical Operators
Some operators like `<, >, ==, >=, <=, !=` can be used to create logical
vectors.
```{r}
# create a numeric vector
x = 1:10
# testing whether x > 5 creates a logical vector
x > 5
x <= 5
x != 5
x == 5
```
We can also assign the results to a variable:
```{r}
y = (x == 5)
y
```
## Indexing Vectors
In R, an index is used to refer to a specific element or
set of elements in an vector (or other data structure). [R uses `[` and `]` to perform indexing,
although other approaches to getting subsets of larger data
structures are common in R.
```{r}
x = seq(0,1,0.1)
# create a new vector from the 4th element of x
x[4]
```
We can even use other vectors to perform the "indexing".
```{r}
x[c(3,5,6)]
y = 3:6
x[y]
```
Combining the concept of indexing with the concept of logical vectors
results in a very power combination.
```{r}
# use help('rnorm') to figure out what is happening next
myvec = rnorm(10)
# create logical vector that is TRUE where myvec is >0.25
gt1 = (myvec > 0.25)
sum(gt1)
# and use our logical vector to create a vector of myvec values that are >0.25
myvec[gt1]
# or <=0.25 using the logical "not" operator, "!"
myvec[!gt1]
# shorter, one line approach
myvec[myvec > 0.25]
```
## Character Vectors, A.K.A. Strings
R uses the `paste` function to concatenate strings.
```{r}
paste("abc","def")
paste("abc","def",sep="THISSEP")
paste0("abc","def")
## [1] "abcdef"
paste(c("X","Y"),1:10)
paste(c("X","Y"),1:10,sep="_")
```
We can count the number of characters in a string.
```{r}
nchar('abc')
nchar(c('abc','d',123456))
```
Pulling out parts of strings is also sometimes useful.
```{r}
substr('This is a good sentence.',start=10,stop=15)
```
Another common operation is to replace something in a string with something (a find-and-replace).
```{r}
sub('This','That','This is a good sentence.')
```
When we want to find all strings that match some other string, we can use `grep`, or "grab regular expression".
```{r}
grep('bcd',c('abcdef','abcd','bcde','cdef','defg'))
grep('bcd',c('abcdef','abcd','bcde','cdef','defg'),value=TRUE)
```
## Missing Values, AKA “NA”
R has a special value, “NA”, that represents a “missing” value, or *Not Available*, in a
vector or other data structure. Here, we just create a vector to experiment.
```{r}
x = 1:5
x
length(x)
```
```{r}
is.na(x)
x[2] = NA
x
```
The length of `x` is unchanged, but there is one value that is marked as "missing" by virtue of being `NA`.
```{r}
length(x)
is.na(x)
```
We can remove `NA` values by using indexing. In the following, `is.na(x)` returns a logical vector the
length of `x`. The `!` is the logical _NOT_ operator and converts `TRUE` to `FALSE` and vice-versa.
```{r}
x[!is.na(x)]
```
## Factors
A factor is a special type of vector, normally used to hold a
categorical variable--such as smoker/nonsmoker, state of residency, zipcode--in many statistical functions. Such vectors have class “factor”. Factors are primarily used in Analysis of Variance (ANOVA) or other situations when "categories" are needed. When a factor is used as a predictor variable, the corresponding indicator variables are created (more later).
Note of caution that factors in R often *appear* to be character vectors
when printed, but you will notice that they do not have double quotes
around them. They are stored in R as numbers with a key name, so
sometimes you will note that the factor *behaves* like a numeric vector.
```{r}
# create the character vector
citizen<-c("uk","us","no","au","uk","us","us","no","au")
# convert to factor
citizenf<-factor(citizen)
citizen
citizenf
# convert factor back to character vector
as.character(citizenf)
# convert to numeric vector
as.numeric(citizenf)
```
R stores many data structures as vectors with "attributes" and "class" (just so you have seen this).
```{r}
attributes(citizenf)
class(citizenf)
# note that after unclassing, we can see the
# underlying numeric structure again
unclass(citizenf)
```
Tabulating factors is a useful way to get a sense of the "sample" set available.
```{r}
table(citizenf)
```