-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathimport-export.Rmd
460 lines (333 loc) · 15.3 KB
/
import-export.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: 'Import/export'
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=F, message=F, warning=F)
library(tidyverse)
library(broom)
library(pander)
```
## Importing data {- #importing-data}
If you have data outside of R, _the simplest way to import it is to first save
it as a comma or tab-separated text file_, normally with the file extension
`.csv` or `.txt`^[This is easy to achieve in Excel and most other stats packages
using the `Save As...` menu item].
Let's say we have file called `angry_moods.csv` in the same directory as our
`.Rmd` file. We can read this data using the `read_csv()` function from the
`readr` package[^readr]:
[^readr]:
There are also standard functions built into R, such as `read.csv()` or
`read.table()` for importing data. These are fine if you can't install the
`readr` package for some reason, but they are quite old and the default
behaviour is sometimes counterintuitive. I recommend using the `readr`
equivalents: `read_csv()` or `read_tsv()`.
```{r}
angry.moods <- readr::read_csv('data/angry_moods.csv')
head(angry.moods)
```
As you can see, when loading the `.csv` file the `read_csv()` makes some
assumptions about the _type_ of data the file contains. In this case, all the
columns contain integer values. It's worth checking this message to make sure
that stray cells in the file you are importing don't cause problems when
importing. Excel won't complain about this sort of thing, but R is more strict
and won't mix text and numbers in the same column.
A common error is for stray notes or text values in a spreadsheet to cause a
column which should be numeric to be converted to the `character` type.
Once it's loaded, you can use this new dataset like any other:
```{r}
pairs(angry.moods)
```
### Importing data over the web {- #importing-data-from-the-web}
One neat feature of the `readr` package is that you can import data from the
web, using a URL rather than a filename on your local computer. This can be
really helpful when sharing data and code with colleagues. For example, we can
load the `angry_moods.csv` file from a URL:
```{r, message=F, cache=T, eval=F}
angry.moods.from.url <- readr::read_csv(
"https://raw.githubusercontent.com/benwhalley/just-enough-r/master/angry_moods.csv")
head(angry.moods.from.url)
```
### Importing from SPSS and other packages {- #importing-proprietary-formats}
This is often more trouble than it's worth. If using Excel for example, it's
best just to save your data a csv file first and import that.
But if you really must use other formats see
<https://www.datacamp.com/community/tutorials/r-data-import-tutorial>.
## Saving and exporting {-}
Before you start saving data in csv or any other format, as yourself: "Do I need
to save _this_ dataset, or should I simply save my raw data and code?"
Oftentimes it's best to keep only your raw datafiles, with the R code you used
to process them. This keeps your disk tidier, and avoids confusion with multiple
versions of files.
If it takes a long time to process your data though you might want to save
interim steps. And if you share your data (which you should) you might also want
to save simplified or anonymised versions of it, in widely-accessible formats.
### Use CSV files {- #use-csv}
Comma-separated-values files are a plain text format which are idea for storing
and sharing your data. They are:
- Understood by almost every piece of software, ever
- Will be readable in future
- Perfect for storing 2D data (like dataframes)
- Readable by humans (just open them in Notepad)
Commercial formats like Excel, SPSS (.sav) and Stata (.dta) don't have these
properties.
Although CSV has some disadvantages, they are all easily overcome if you
[save the steps of your data processing and analysis in your R code](#save-intermediate-steps),
see below.
Saving a dataframe to .csv is as simple as:
```{r, eval=F}
readr::write_csv(mtcars, 'mtcars.csv')
```
If you run this within an RMarkdown document, this will create the new csv file
in the same directory as your `.Rmd` file.
[You can also use the `write.csv()` function in base R, but this version from
`readr` is faster and has more sensible defalts (e.g. it doesn't write rownames,
but does save column names in the first row)]{.tip}
#### Save processes, not just outcomes {- #save-intermediate-steps}
Many students (and academics) make errors in their analyses because they process
data by hand (e.g. editing files in Excel) or use GUI tools to run analyses.
In both cases these errors are hard to identify or rectify because only the
outputs of the analysis can be saved, and _no record has been made of how these
outputs were produced_.
In contrast, if you do your data processing and analysis in R/RMarkdown you
benefit from a concrete, repeatable series of steps which can be
checked/verified by others. This can also save lots of time if you need to
processing additional data later on (e.g. if you run more participants).
Some principles to follow when working:
- Save your raw data in the simplest possible format, in CSV
- Always include column names in the file
- Use descriptive names, but with a regular strucuture.
- Never include spaces or special characters in the column names. Use
underscores (`_`) if you want to make things more readable.
- Make names <20 characters in length if possible
#### Saving interim steps {-}
If you are saving data to use again later in R, the best format is RDS. Saving
files to RDS [is covered in a later section (click to see)](#rds-files).
If you are saving interim steps but think you might possibly want to access it
from other programmes in future use csv though.
To save something using RDS:
```{r, eval=F}
# create a huge df of random numbers...
massive.df <- data_frame(nums = rnorm(1:1e6))
saveRDS(massive.df, file="massive.RDS")
```
Then later on you can load it like this:
```{r, eval=F}
restored.massive.df <- readRDS('massive.RDS')
```
[If you do this in RMarkdown, by default the RDS files will be saved in the same
directory as your .Rmd file.]{.tip}
### Archiving, publication and sharing {-}
If you want to share data with someone else, or open it in a different software
package, using '.csv' format is strongly recommended unless some other format is
common in your field.
When archiving data, or sharing with others, you must document what each column
measures, and any processing steps used to create the file. RMarkdown is a good
way of doing this because it can combine the processing with narrative
explaining what is being done, and why.
## Dealing with multiple files {- #multiple-raw-data-files}
Often you will have multiple data files files - for example, those produced by
[experimental software](http://www.psychopy.org).
This is one of the few times when you might have to do something resembling
'real programming', but it's still fairly straightforward.
In the [repeated measures Anova example later on in this guide](#trad-rm-anova)
we encounter some data from an experiment where reaction times were recorded in
25 trials (`Trial`) before and after (`Time`) one of 4 experimental
manipulations (`Condition = {1,2,3,4}`). There were 48 participants in total:
```{r, include=F}
expt.data <- readRDS("data/expt.data.RDS")
```
```{r, include=F, eval=F}
# export the expt.data to multiple files as an example
filterandsave <- function(df){
person = first(df$person)
filename <- paste0("data/mutliple-file-example/person", person, ".csv" )
write.csv(df, file=filename, row.names=F)
data_frame(person=person, filename=filename)
}
expt.data %>%
group_by(person) %>%
do(filterandsave(.))
```
Let's say that we have saved all the files are in a single directory, and these
are numbered sequentially: `person01.csv`, `person02.csv` and so on.
Using the `list.files()` function we can list the contents of a directory on the
hard drive:
```{r}
list.files('data/multiple-file-example/')
```
The `list.files()` function creates a [vector](#vectors) of the names of all the
files in the directory.
At this point, there are many, many ways of importing the contents of these
files, but below we use a technique which is concise, reliable, and less
error-prone than many others. It also continues to use the `dplyr` library.
This approach has 3 steps:
1. Put all the names of the .csv files into a dataframe.
2. For each row in the dataframe, run a function which imports the file as a
dataframe.
3. Combine all these dataframes together.
###### Putting the filenames into a dataframe {-}
Because `list.files` produces a vector, we can make them a column in a new
dataframe:
```{r}
raw.files <- data_frame(filename = list.files('data/multiple-file-example/'))
```
And we can make a new column with the complete path (i.e. including the
directory holding the files), using the [`paste0`](#paste) which combines
strings of text. We wouldn't have to do this if the raw files were in the same
directory as our RMarkdown file, but that would get messy.
```{r}
raw.file.paths <- raw.files %>%
mutate(filepath = paste0("data/multiple-file-example/", filename))
raw.file.paths %>%
head(3)
```
###### Using `do()` {- #dplyr-do}
We can then use the `do()` function in `dplyr::` to import the data for each
file and combine the results in a single dataframe.
The `do()` function allows us to run any R function for each group or row in a
dataframe.
The means that our original dataframe is broken up into chunks (either groups of
rows, if we use `group_by()`, or individual rows if we use `rowwise()`) and each
chunk is fed to the function we specify. This function must do it's work and
return a new dataframe, and these are then combined into a single larger
dataframe.
So in this example, we break our dataframe of filenames up into individual rows
using `rowwise` and then specify the `read_csv` function which takes the name of
a csv file, and returns the content as a dataframe
([see the importing data section](#importing-data)).
For example:
```{r, message=F}
raw.data <- raw.file.paths %>%
# 'do' the function for each row in turn
rowwise() %>%
do(., read_csv(file=.$filepath))
```
We can check these data look OK by sampling 10 rows at random:
```{r}
raw.data %>%
sample_n(10) %>%
pander()
```
##### Using custom functions with `do()` {-}
In this example, each of the raw data files included the participant number (the
`person` variable). However, this isn't always the case.
This isn't a problem though, if we create our own
[helper function](#helper-functions) to import the data. Writing small functions
in R is very easy, and the example below wraps the `read.csv()` function and
adds a new colum, `filename` to the imported data frame which would enable us to
keep track of where each row in the final combined dataset came from.
This is the helper function:
```{r}
read.csv.and.add.filename <- function(filepath){
read_csv(filepath) %>%
mutate(filepath=filepath)
}
```
In English, you should read this as:
> "Create a new R function called `read.csv.and.add.filename` which expects to
> be passed a path to a csv file as an input. This function reads the csv file
> at the path (converting it to a dataframe), and adds a new column containing
> the original file path it read from. It then returns this dataframe."
We can use our helper function with `do()` in place of the bare `read_csv`
function we used before:
```{r}
raw.data.with.paths <- raw.file.paths %>%
rowwise() %>%
do(., read.csv.and.add.filename(.$filepath))
raw.data.with.paths %>%
sample_n(10) %>%
pander()
```
At this point you might need to
[use the `extract()` or `separate()` functions](#extract-to-split-column-names)
to post-process the filename and re-create the `person` variable from this
(although in this case that's already been done for us).
<!-- TODO1 -->
<!-- - File handling and import -->
<!-- - Writing a function for `do()` which returns a dataframe -->
<!-- - Joins and merges -->
<!-- ## Error checking {-} -->
<!-- - `999`, `666` and `*`: the marks of the beast! -->
## Joining different datasets {- #joining-data}
Many analyses combine data from different sources.
Even within a single study, you may find that you have different datafiles (e.g.
spreadsheets) for different sorts of information. For example you might have
1. Multiple csv files output by experimental software.
2. A spreadsheet of demographic characteristics of participants (e.g. their age,
gender, handedness)
3. Questionnaire data, collected before the reaction time task.
Your main analysis might want to model individual RTs using predictors including
age, handedness or some personality variable. Thus, you probably want data which
look something like this:
```{r, include=F}
set.seed(1234)
df <- data_frame(person=c(1,1,1,1,2,2,2,2),
trial = c(c(1:3, "..."), c(1:3, "...")), condition = c(rep("A",4), rep("B",4)),
female=c(rep(T, 4), rep(F, 4)), left.handed=c(rep(F, 4), rep(F, 4)),
age = c(rep(19, 4), rep(24, 4)), extraversion = c(rep(50, 4), rep(34, 4))) %>%
mutate(RT = rnorm(n(), 300,40))
```
```{r}
df %>%
pander()
```
#### The raw data {-}
In this example, we could imagine our raw data files looking like this:
```{r, include=F}
rts <- df %>%
select(person, trial, RT)
demographics <- df %>%
group_by(person) %>%
summarise_at(funs(first), .vars=vars(age, left.handed, female)) %>%
bind_rows(., data_frame(person=3, age=33, left.handed=T, female=F))
personality <- df %>%
group_by(person) %>%
summarise_at(funs(first), .vars=vars(extraversion)) %>%
bind_rows(., data_frame(person=3, extraversion=47))
```
```{r}
rts %>%
pander()
```
```{r}
demographics %>%
pander()
```
And:
```{r}
personality %>%
pander()
```
#### Joining the parts {-}
To create the combined data file we want, we have to _join_ the different files
together.
As you might have noticed, though, the RT's file has many observations per
participant, whereas the demographics and personality data has one row per
person.
What we need is to smush this wide format data into the RTs file, such that
values get repeated for every row of each participants' data.
To do this, the simplest method is to use the `dplyr::left_join()` function:
```{r}
left_join(rts, demographics, by="person") %>%
pander
```
In this example I explicitly set `id="person"` to let R know which variable to
use to match the rows of data, although you don't _have_ to, and `dplyr` can
normally guess. You _do_ have to use the same variable name in both files though
(so, `person` and `participant` couldn't be matched, for example).
##### Other types of joins {- .tip}
As the dplyr manual states: left joins return for two dataframes, x and y will
return all rows from x, and all columns from both x and y. Rows in x with no
match in y will have NA values in the new columns. If there are multiple matches
between x and y, all combinations of the matches are returned.
Left joins are probably the most useful, but there are other types which can be
useful. To see all of them look at the help files (`help('join', 'dplyr')`).
To show one common example, if we wanted to check whether we were missing RT
data for one of our participants, we could use an `anti_join`:
```{r}
anti_join(demographics, rts, by="person") %>%
pander
```
This table lists the data for _all the people in the demographics file, for whom
we don't have RT data_. This can be a useful way of checking you haven't mislaid
a raw datafile (e.g. forgot to copy it from the lab machine!).