-
Notifications
You must be signed in to change notification settings - Fork 14
/
io05-importing.Rmd
720 lines (518 loc) · 35 KB
/
io05-importing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
---
title: "Importing Data into R"
author: "Peter Higgins"
output: html_document
editor_options:
markdown:
wrap: sentence
---
```{r setup, include=FALSE}
library(tidyverse)
library(medicaldata)
library(readxl)
library(skimr)
library(here)
library(glue)
library(webexercises)
library(httr)
github_raw_link <- "https://github.com/higgi13425/rmrwr-book/raw/master/data/paulolol.xlsx"
paulolol_xlsx <- tempfile(fileext = ".xlsx")
req <- httr::GET(github_raw_link,
# write result to disk
write_disk(path = paulolol_xlsx))
# add/download files from https://www.cdc.gov/nchs/hus/contents2016.htm
```
# Importing Your Data into R
For most of this book, we will be using datasets from the {medicaldata} package.
These are easy to load.
If you type into the Console pane `medicaldata::scurvy` , you will get James Lind's scurvy dataset (actually, a reconstruction of what it might have looked like for his 12 participants).
If you want to save this data to an object in your work environment, you just need to assign this to a named object, like `scurvy`, like so, with an assignment arrow, composed of a less than sign (<) and a hyphen (-):
```{r}
scurvy <- medicaldata::scurvy
# now print the columns for id and treatment
scurvy %>%
select(study_id:treatment) |> #this selects columns
slice_tail(n = 7) #slices last seven rows
```
There are a number of medical datasets to explore and learn with, within the {medicaldata} package.
However, at some point, you will want to use R to work on your **own** data.
You may already be itching to get started on your own data.
This is a good thing.
Working with your own data, toward your own goals, will be a motivating example, and will help you learn R.
As you go through the different chapters, use the example data and exercises to get you started and to learn the principles, and then try what you have learned on your own data.
::: {.warning}
Reproducibility and Raw Data<br>
It is an important principle to *always* save an untouched copy of your raw data.
You can copy it to a new object, and experiment with modifying it, cleaning it, making plots, etc., but *always* leave the original data file untouched.
You want to create a completely reproducible, step-by-step, scripted trail from your raw data to your finished analysis and final report, and you can only do that if you preserve the original raw data.
That is the cornerstone of your analysis.
It is tempting to fix minor data entry errors, or other aspects of the raw data.
Do not do this - leave all errors intact in your raw data, and explicitly make edits with explanations of
- who made the edit
- when it was made
- what was changed
- and why the edit was made.
You should provide a justification for each change, and identify source documents to support the rationale. Every edit should be documented in your code, with who, when, what, and why.
:::
Now, on to the fun part.
Let's read in some data!
## Reading data with the {readr} package
Many of the standard data formats can be read with functions in the {readr} package. This package is part of the {tidyverse} meta-package.
These functions include:
- read_csv() for comma-separated values (\*.csv) files
- read_tsv() for tab-separated values (\*.tsv) files
- read_delim() for files with a different delimiter that you can specify (instead of commas or tabs, there might be semicolons), or you can let {readr} guess the delimiter in readr 2.0.
- read_fwf() for fixed width files
- read_table() for tabular files where columns are separated by white-space.
- read_log() is specifically for web log files
Let's read a csv file.
First, make sure that you have the {readr} package loaded (or the {tidyverse} meta-package, which includes {readr}).
You can load {readr} with the `library()` function, if you have already installed this package.
```{r, results='hide'}
library(readr)
# or you can use
library(tidyverse) # which will load 8 packages, including readr
# the other 7 packages are ggplot, tibble, tidyr, dplyr, purrr, stringr, and forcats
```
Note that this will *not* work if you do not already have the {readr} package installed on your computer.
You will get an error, like this: `Error in library(readr) : there is no package called 'readr'`
This is not a big problem - but you have to install the package first, before you can load it.
You only need to do this once, like buying a book, and putting it in your personal library.
Each time you use the package, you have to pull the book off the shelf, with `library(packagename)`.
To install the {readr} package, we can install the whole {tidyverse} package, which will come in handy later.
Just enter the following in your Console pane to install the {tidyverse} meta-package:
`install.packages('tidyverse')`
Note that quotes around 'tidyverse' are required, as tidyverse is not yet a known object or package in your working environment.
Once the {tidyverse} package is installed, you can use `library(tidyverse)` without quotes, as it is a known (installed) package.
We will also want to use the {glue} package shortly to access files on the web. Practice installing this package. Copy and run the code chunk below in your RStudio Console panel to install and load the {glue} package.
```{r, results='hide'}
install.packages('glue')
library(glue)
```
OK, after that detour, we should be all caught up - you should be able to run `library(tidyverse)` or `library(readr)` or `library(glue)` without an error.
Now that you have {readr} loaded, you can read in some csv data.
Let's start with a file named `scurvy.csv` in a `data` folder on GitHub. You will need to glue together the url_stem (the first part of the path to the web address) and "data/scurvy.csv" (the folder and file name) to get the full web address. Copy and run the code chunk below to see the url_stem and the dataset.
```{r}
url_stem <- "https://raw.githubusercontent.com/higgi13425/rmrwr-book/master/"
url_stem
read_csv(glue(url_stem, 'data/scurvy.csv'))
```
Let's look at what was extracted from the csv file.
This starts after the url_stem (web address) is printed out.
The Console pane has a
1) the number of rows (12) and columns (8) in the dataset.
2) The Column specification, followed by
3) the data in rectangular format.
In the Column specification, the delimiter between items of data is identified (a comma), and a listing of variables with the `character` (chr) data type is printed. There are no non-character data types in this particular dataset.
The 'guessing' of what data type is appropriate for each variable is done 'automagically', as the read_csv() function reads the first 1000 rows, then guesses what data type is present in each column (variable).
It is often conservative, and in this case, made all of these columns into character variables (also called strings).
You could argue that the col_character() assignment should be numeric for the study_id variable, or that the Likert scales used for outcomes like gum_rot_d6 and skin_sores_d6 should be coded as ordinal categorical variables, known as ordered **factors** in R.
You will learn to control these data types during data import with the `spec()` argument.
The last piece of output is the data rectangle itself.
This is first identified as a 'tibble', which is a type of data table, with 12 rows and 8 columns, in `# A tibble: 12 x 8`.
This is followed by a header row of variable names, and just below that is the data type (`<chr>` for character) for each column.
Then, on the left are gray row numbers (which are not actually part of the data set), followed by (to the right) rows of data.
A tibble, by default, only prints out only as many rows of data as will fill the console, and no more columns than will fill the width of your current console window.
The other columns (variable names) are listed in order at the bottom of the tibble in gray type.
Now, by simply reading in the data, you can look at it, but you can't do anything with it, as you have not saved it as an object in your working Environment.
If you want to do things with this data, and make them last, you have to *assign* the data to an object, and give it a name.
To do this, you need to use an assignment arrow, as below, where the data is assigned to the object, `scurvy_data`.
```{r}
scurvy_data <- read_csv(glue(url_stem, 'data/scurvy.csv'))
```
Now this is saved to `scurvy_data` in your working Environment.
You can now look in the Environment tab (top right pane in RStudio) and see that scurvy_data has now appeared in the Data section, with 12 observations of 8 variables.
This is not a file written (saved) to disk, but this dataset is now available in the working environment as an assigned data object.
You can click on the blue arrow next to `scurvy_data` to show the variable names and variable types for this dataset.
You can also click on the name `scurvy_data` to open up a Viewer window in the top left (Source) pane. You can scroll around to inspect all of the data. You can also get this view by typing `View(scurvy_data)` into the Console pane.
You can also print out the data to the Console pane at any time by typing `scurvy_data` into the Console, or into a script. Try this out in the Console pane.
### Test yourself on scurvy
Inspect the data in the Viewer, and find each answer.
- How many limes did the British seamen in the citrus arm receive each day?
`r webexercises::mcq(c('3', '2', answer = 'zero', '1.5'))`
You can also start with the `scurvy_data` object, and *do* things to this data object, like summarize it, or graph it, or calculate total_symptom_score in a data pipeline.
Once you have assigned your data to an object, it will stick around in that R session for later use. Your changes will disappear when you close the session, unless you save it to disk, with functions like `save()` or `write_csv()`.
The csv (comma separated values) format is a very common data format, and most data management programs have a way to export `*.csv` files.
The csv format is simple, is not owned by anyone, and works across platforms.
However, it can occasionally be tricky if you have commas in the middle of a variable like `degree`, with entries like 'md, phd' in a column that is mostly 'md'.
The read_csv function is pretty smart, and usually gets the number of columns right, but this is something to watch out for.
**Notice** that read_csv had no problem with the dosing of vinegar ("two spoonfuls, three times a day") in the scurvy dataset, despite multiple commas.
If you happen to come across a tab-separated values file, *read_tsv()* works the same way.
Both of these functions have reasonable defaults, so that most of the time, you just have to use the path to your file (usually on your hard drive, rather than on the web) as the only argument.
On occasion, though, you will want to take control of some of the other arguments, and use something other than the defaults.
::: {.warning}
### What is a path?<br>
A **path** is the trail through the folders in your hard drive (or on the web) that the computer needs to follow to find a particupar file. Paths can look something like: <br>
- C:/Documents/Rcode/my_file.R
- ~/User/Documents/Rcode/my_file.R
and can get pretty complicated to keep track of. One particularly nice feature of Projects in RStudio is that the project directory is always your home, or root directory. You can make your life easier by using the {here} package, which memorizes the path to your project, so you can just write `here(my_file.R)`, and not have to worry about making a typo in a long path name.
:::
When your data has no column names (headers), read_csv will (by default) assume that the first row of the data is the column names.
To fix this, add the argument, **col_names** = FALSE.
You can also assign your own **col_names** by setting a vector, like c("patient_id", "treatment", "outcome") [note that `c()` concatenates data items into a vector] equal to to col_names, as below. Copy and run this code chunk in your RStudio Console pane to see how the output is different.
```{r}
read_csv(file = glue(url_stem, 'data/scurvy.csv'),
col_names = c("pat_id", "arm", "dose", "gums", "skin", "weak", "lass", "fit"))
```
In this case, when we set our own **col_names**, there are now **13** rows in our data rectangle, and the original column headers are now listed as the first row of data.
We can fix this with the **skip** argument within the parentheses of the `read_csv()` function, which has a default of 0.
We can skip as many lines as we want, which can be helpful if you have an Excel file with a lot of blank lines or commentary at the top of the spreadsheet.
When we set skip = 1 in this case, we get a cleaner dataset, without variable names as data.
```{r}
read_csv(file = glue(url_stem, 'data/scurvy.csv'),
col_names = c("pat_id", "arm", "dose", "gums", "skin", "weak", "lass", "fit"),
skip = 1)
```
Now we don't have extra column names as data, and we are back to 12 rows.
Also note that in this code chunk, we put each argument to the function on its own line, with commas between them.
This is a good practice, to make your code more readable.
More arguments to `read_csv()`: you can also set **n_max** to a particular number of rows to be read in (the default is infinity, or `Inf`) You might want a smaller number if you have a very large dataset and limited computer memory.
Another important argument (option) for both read_csv and read_tsv is **col_types**, which lets you take control of the column types during the data
What if you want to take more control of the import process with `r`ead_csv()`?
You can add a `col_types` argument to the `read_csv()` function.
You can copy the Column specifications from the first attempt at importing, and then make some edits. You can get the column specifications as guessed by {readr} by running the `spec()` function on the scurvy_data object.
Try this out in the code chunk below.
```{r}
spec(scurvy_data)
```
This sets the data type for each column (variable).
This is helpful if you want to change a few of these.
Take a look at the next code chunk below.
I have added the col_types argument to _read_csv()_, and set it equal to the Column specifications (copied from above).
Then I edited study_id to `col_integer()`, and treatment to `col_factor()`.
Run the code chunk below to see how this works.
The `glimpse` function will give an overview of the new `scurvy_cols` object that I assigned the data to.
```{r}
scurvy_cols <- read_csv(
file = glue(url_stem, 'data/scurvy.csv'),
col_types = cols(
study_id = col_integer(),
treatment = col_factor(),
dosing_regimen_for_scurvy = col_character(),
gum_rot_d6 = col_character(),
skin_sores_d6 = col_character(),
weakness_of_the_knees_d6 = col_character(),
lassitude_d6 = col_character(),
fit_for_duty_d6 = col_character()
)
)
glimpse(scurvy_cols)
```
You can see that study_id is now considered the integer data type (`<int>`), and the treatment variable is now a factor (`<fct>`).
You can choose as data types:
- col_integer ()
- col_character()
- col_number() (handles \#s with commas)
- col_double() (to specify decimal \#s)
- col_logical() (only TRUE and FALSE)
- col_date(format = "") - may need to define format
- col_time(format = "") - col_datetime(format = "")
- col_factor(levels = "", ordered = TRUE) - you may want to set levels and ordered if ordinal.
- col_guess() - is the default
- col_skip() if you want to skip a column
The read_csv() guesses may be fine, but you can take more control if needed.
This *col_types()* approach gives you fine control of each column.
But it is a lot of typing (even with copy-paste from `spec()`).
Sometimes you want to set all the column types with a lot less typing, and you don't need to set levels for factors, or formats for dates.
You can do this by setting col_types to a string, in which each letter specifies the column type for each column.
Run the example below by clicking on the green arrow at the top right of the code chunk, in which I use i for col_integer, c for col_character, and f for col_factor.
```{r}
scurvy_cols2 <- read_csv(
file = glue(url_stem, 'data/scurvy.csv'),
col_types = "ifcffff")
glimpse(scurvy_cols2)
```
### Try it Yourself
Now try this yourself with a *.tsv file.
The file `strep_tb.tsv` is located in the same GitHub folder, and you can use the same url_stem.
In the example code chunk below, there are several blanks.
Copy this code chunk (use the copy button in the top right of the code chunk - hover to find it) to your RStudio Console pane.
Edit it to make the three changes listed, and run the code chunk as directed below.
1) Fill in the second part of the read_xxx() function correctly to read this file
2) Fill in the correct file name to complete the path
This version of the code chunk will read in every column (all 13 variables) as the character data type.
This is OK, but not quite right.
3) Now edit the `col_types` string to make:
- both doses numeric (n or d) (variables 3,4)
- `gender` a factor (f) (var 5)
- all 4 of the `baseline` variables into factors (var 6-9)
- skip over strep_resistance and radiologic_6m - set as hyphens (`-`) (var 10-11)
- `rad_num` into an integer (i) (var 12)
- `improved` into a logical (l) (var 13)
```{r, error=TRUE, eval=FALSE}
strep_tb_cols <- read_---(
file = glue(url_stem, 'data/----.tsv'),
col_types = "ccccccccccccc")
glimpse(strep_tb_cols)
```
`r hide('Show the Solution')`
```{r solution}
strep_tb_cols <- read_tsv(
file = glue(url_stem, 'data/strep_tb.tsv'),
col_types = "ccddfffff--il")
glimpse(strep_tb_cols)
```
`r unhide()`
## Reading Excel Files with readxl
While file types like \*.csv and \*.tsv are common, it is also common to use Microsoft Excel or an equivalent for data entry.
There are a lot of reasons that this is not a great idea (see the video at [link](https://www.youtube.com/watch?v=9f-hpJbjKZo)), but Excel is so ubiquitous that it is often used for data entry.
Fortunately, the {readxl} package provides functions for reading excel files.
The *read_excel()* function works nearly the same as *read_csv()*.
But there are a few bonus arguments (options) in *read_excel()* that are really helpful.
The *read_excel()* function includes helpful arguments like **skip**, **col_names**, **col_types**, and **n_max**, much like *read_csv()*.
In addition, read_excel() has a **sheet** argument, which lets you specify which sheet in an excel workbook you want to read.
The default is the first worksheet, but you can set this to sheet = 4 for the 4th worksheet from the left, or sheet = "raw_data" to get the correct worksheet named "raw_data".
You can also set the range argument to only read in a particular range of cells, like range = "B2:G14".
Let's set up your local RStudio to read in a spreadsheet named **paulolol.xlsx**. Copy and run the code below in your local version of RStudio to prepare the file. This will pull an excel file from the web and write a copy to your local storage.
```{r}
library(httr)
github_raw_link <- "https://github.com/higgi13425/rmrwr-book/raw/master/data/paulolol.xlsx"
paulolol_xlsx <- tempfile(fileext = ".xlsx")
req <- httr::GET(github_raw_link,
# write result to disk
write_disk(path = paulolol_xlsx))
```
Below is an example of how to read in this Excel worksheet. Copy and run this code locally, and see what you get in the Console for output.
```{r, error=TRUE}
library(readxl)
read_excel(paulolol_xlsx,
sheet = 1,
skip = 0)
```
You will notice that
1. a lot of the variable names became `..2` , `..3`, etc.
2. the first two rows have a lot of NAs (??).
3. the last 3 rows have a lot of NAs, and some mean values(!?).
This is because the person who created this spreadsheet did not make this a clean rectangle of data. This is a common problem with Excel. It is easy to add extra, descriptive header lines, and the variable names don't even appear until row 3. Further, the user calculated some mean values for sbp and hr in rows 12-14, which are being read in like regular observations (because while you can figure out what happened, Excel can't).
This is a problem. We want a clean rectangle of data, with the true variable names at the top. There are actually only 8 rows of observations in this Excel spreadsheet.
Try editing the skip and n_max arguments until you can read in a clean rectangle of data below. Copy the code hung
```{r skip, error=TRUE, eval=FALSE}
read_excel(paulolol_xlsx,
sheet = 1,
skip = ___,
n_max = ___)
```
`r hide('Show Solution')`
```{r skip-solution, error=TRUE, eval=FALSE}
read_excel(paulolol_xlsx,
sheet = 1,
skip = 3,
n_max = 8)
```
`r unhide()`
### Test yourself on read_excel()
- which argument in read_excel lets you jump past initial rows of commentary?
`r webexercises::mcq(c("jump", 'sheet', answer = 'skip', 'path'))`
- which argument in read_excel lets you pick which spreadsheet tab to read?
`r webexercises::mcq(c("pick", answer = 'sheet', 'skip', 'path'))`
- How many missing (NA) values are in the paulolol dataset when run with skip = 1?
`r webexercises::fitb('15')`
- what should the **range** argument be to read in these data cleanly (without needing n_max or skip)?
`r webexercises::mcq(c("B2:F8", answer = 'A4:F12', 'A1:L30', 'B4:K15'))`
## Bringing in data from other Statistical Programs (SAS, Stata, SPSS) with the {haven} package
It is common to have the occasional collaborator who still uses one of the older proprietary statistical packages. They will send you files with filenames like `data.sas7bdat` (SAS), `data.dta` (Stata), or `data.sav` (SPSS).
The {haven} package makes reding in these data files straightforward.
** Set up with web files**
Copy and run the code chunk below to import a web file for SAS.
The {haven} package provides functions like *read_sas()*, *read_dta()* and *read_sav()* to enable you to read proprietary file formats into R.
```{r}
url_stem <- "https://raw.githubusercontent.com/higgi13425/rmrwr-book/master/"
haven::read_sas(glue(url_stem, "data/blood_storage.sas7bdat"))
```
Similarly, you can read and write Stata files with *read_dta()* and *write_dta()*. Try out the example below.
```{r, error=TRUE}
haven::read_dta(glue(url_stem, "data/blood_storage.dta"))
```
You can also read and write SPSS files with *read_sav()* and *write_sav()*. Try out the example below.
```{r, error=TRUE}
haven::read_sav(glue(url_stem, "data/strep_tb.sav"))
```
You can learn more about how to read and write other data file types at [haven](https://haven.tidyverse.org)
## Other strange file types with rio
Once in a while, you will run into a strange data file that is not a csv or Excel or or a proprietary format from a common statistical package (SAS, Stata, SPSS). These might include data formats from Systat, Minitab, RDA, feather, or others.
This is when the {rio} package comes to the rescue. The name, rio, stands for R Input and Output. The {rio} package looks at the file extension (like .csv, .xls, .dta) to guess the file type, and then applies the appropriate method to read in the data. The import() function in {rio} makes data import much easier. You don't always have the fine control seen in {readr} or {readxl}, but {rio} is an all-purpose tool that can get nearly any data format into R.
Try this out with the code chunk below.
Just replace the *filename* in the code chunk with one of the files named below.
Try to use the same `import()` function to read
- scurvy.csv
- strep_tb.tsv
- paulolol.xlsx
- blood_storage.sas7bdat
```{r, error = TRUE}
rio::import(glue(url_stem, "data/filename"))
```
It can be very convenient to use {rio} for unusual file types. You can read more about the [rio package](https://www.rdocumentation.org/packages/rio/versions/0.5.29)
## Data exploration with glimpse, str, and head/tail
Once you have a dataset read into your working Environment (see the Environment tab in RStudio), you will want to know more about it.
There are several helpful functions and packages to get you started in exploring your data.
### Taking a glimpse with *glimpse()*
The *glimpse()* function is part of the tidyverse, and is a helpful way to see a bit of all of the variables in a dataset.
Let's try this with the scurvy dataset, which we have already assigned to the object `scurvy` in the working Environment.
Just put the object name as an argument within the *glimpse()* function (inside the parentheses), as below.
Run the code chunk below to get the glimpse() output.
```{r}
glimpse(scurvy)
```
The *glimpse()* function output tells you that there are 12 rows (observations) and 8 columns (variables).
Then it lists each of the 8 variables, followed by the data type, and the first few values (or as much as will fit in the width of your Console pane).
We can see that study_id and dosing_regimen_for_scurvy are both of the `character` (aka string) data type, and the other 6 variables are factors.
### Try this out yourself.
What can you learn about the strep_tb dataset with *glimpse()*?
Copy and edit the code chunk below in your local RStudio and run it to find out about the strep_tb dataset, using *glimpse()*.
```{r, error=TRUE, eval=FALSE}
library(tidyverse)
glimpse(----)
```
### Test yourself on strep_tb
- which variable is the logical data type?
`r webexercises::mcq(c('baseline_esr', answer = 'improved', 'patient_id', 'baseline_condition'))`
- which variable is the dbl numeric data type?
`r webexercises::mcq(c('arm', 'patient_id', answer = 'rad_num', 'baseline_temp'))`
- How many observations are in this dataset?
`r webexercises::fitb('107')`
### Examining Structure with *str()*
The *str()* function is part of the {utils} package in base R, and can tell you the structure of any object in R, whether a list, a dataset, a tibble, or a single variable.
It is very helpful for reality-checking your data, especially when you are getting errors in your code.
A common source of errors is trying to run a function that requires a particular data structure or data type on the wrong data structure or data type.
Sometimes just checking the data structure will reveal the source of an error.
The *str()* function does largely what *glimpse()* does, but provides a bit more detail, with less attractive formatting.
Run the code chunk below to see the output of *str()* on the scurvy dataset.
```{r}
str(scurvy)
```
The str() output starts by telling you that scurvy is a tibble, which is a modern sort of data table.
A tibble will by default only print 10 rows of data, and only the number of columns that will fit in your Console pane.
Then you see [12 x 8], which means that there are 12 rows and 8 columns - the default in R is to always list rows first, then columns (R x C notation).
Then you learn that this is an S3 object, that is a tbl_df (tibble), and also a tbl, and also a data.frame).
Then you get a listing of each variable, data type, and a bit of the data, much like *glimpse()*.
Another extra detail provided by str() is that it tells you some of the levels of each factor variable, and then shows these as integers (note that the data for factors is actually stored as integers [in part to save storage space]).
### Test yourself on the scurvy dataset
- what is the dose of cider?
`r webexercises::mcq(c('25 drops', answer = '1 quart per day', 'one-half rood', '3 gtt'))`
- how many levels of gum_rot are there?
`r webexercises::fitb('4')`
- Which numeric value indicates 'fit for duty'?
`r webexercises::fitb('1')`
Note that you can also use *str()* and *glimpse()* on a single variable.
You often use this approach when you get an error message that tells you that you have the wrong data type.
Try this with the strep_tb dataset variable `patient_id` by running the code chunk below.
Imagine that you wanted to get the mean of `patient_id`.
You got a **warning** that pointed out that the `argument is not numeric or logical`.
So you run *str()* to find out the data structure of this variable.
```{r}
mean(medicaldata::strep_tb$patient_id)
str(medicaldata::strep_tb$patient_id)
```
This shows you that patient_id is actually a character variable.
If you wanted to find the mean value, you would have to change it to numeric (with *as.numeric()*) first.
The glimpse() function provides identical output to str() for a variable in a table.
```{r}
glimpse(medicaldata::strep_tb$patient_id)
```
You can choose whether you prefer the details of *str()* or the nicer formatting of *glimpse()* for yourself.
### Examining a bit of data with *head()* and *tail()*
Oftentimes, you want just a quick peek at your data, especially after a merge or a mutate, to make sure that things have gone as expected.
This is where the base R functions *head()* and *tail()* can be helpful.
As you might have guessed, these functions give you a quick view of the head (top 6 rows) and tail (last 6 rows) of your data.
Try this out with scurvy or strep_tb.
```{r}
head(medicaldata::scurvy)
tail(medicaldata::strep_tb)
```
Note that since these are tibbles, they will only print the columns that will fit into your Console pane.
You can see all variables and the whole width (though it will wrap around to new lines) by either (1) converting these to a data frame first, to avoid tibble behavior, or (2) by using print, which has a width argument that allows you to control the number of columns (it also has an *n* argument that lets you print all rows).
Run the code chunk below to see how this is different.
```{r}
head(as.data.frame(medicaldata::scurvy))
print(tail(medicaldata::strep_tb, width = Inf))
```
It is actually much easier to see the full width and height of a data set by scrolling, which you can do when you *View()* a dataset in the RStudio viewer. Try this out in the Console pane, with `View(strep_tb)`.
### Test yourself on the printing tibbles
- What function (and argument) let you print all the columns of a tibble?
`r webexercises::mcq(c(answer = 'print(x, width=Inf)', 'filter(starts_with("width"))','sum(na.rm=TRUE)', 'print(wide)'))`
- What function (and argument) let you print all the *rows* of a tibble?
`r webexercises::mcq(c('filter(ends_with("all"))','median(na.rm=TRUE)', answer = 'print(x, n=Inf)', 'print(long)'))`
If you just want a quick view of a few critical columns of your data, you can obtain this with the *select()* function, as in the code chunk below.
Note that if you want to look at a random sample of your dataset, rather than the head or tail, you can use *slice_sample()* to do this. This sampling can be helpful if your data are sorted, or the head and tail rows are not representative of the whole dataset.
See how this is used by running the code chunk below, which uses a 10% random sample of strep_tb to check that the mutate steps to generate the variables `rad_num` and `improved` worked correctly.
```{r}
#check that radiologic_6m, rad_num, and improved all match
medicaldata::strep_tb %>%
slice_sample(prop = 0.1) %>%
select(radiologic_6m, rad_num, improved)
```
## More exploration with skimr and DataExplorer
Once you have your data read in, you usually want to get an overview of this new dataset. While there are many ways to explore a dataset, I will introduce two:
- skim() from the {skimr} package
- create_report() from the {DataExplorer} package
You can get a more detailed look at a dataset with the {skimr} package, which has the `skim()` function, gives you a quick look at each variable in the dataset, with simple output in the Console. Try this out with the **strep_tb** dataset.
Run the code chunk below, applying the `skim()` function to the strep_tb dataset.
```{r}
library(skimr)
skimr::skim(medicaldata::strep_tb)
```
### Test yourself on the `skim()` results
- How many females participated in the strep_tb study?
`r webexercises::mcq(c('52', answer = '59', '48', '37'))`
- What proportion of subjects in strep_tb were improved?
`r webexercises::mcq(c('0.71', '0.493', answer = '0.514', '0.55'))`
- What is the mean value for rad_num in strep_tb?
`r webexercises::mcq(c('7.44', answer = '3.93', '1.89', '4.7'))`
A fancier approach is taken by the DataExplorer package, which can create a full html report with correlations and PCA analysis. Copy and run the chunk below to see what the Data Profiling Report looks like.
```{r, eval=FALSE}
DataExplorer::create_report(medicaldata::strep_tb)
```
### Test yourself on the `create_report()` results
- What percentage of the baseline_esr values were missing?
`r webexercises::mcq(c('9.3%', '0.09%', answer = '0.93%', '1%'))`
- How many total data points ('observations') are there in the strep_tb dataset?
`r webexercises::mcq(c('10,000', answer = '107', '1,391', '17'))`
You can choose which you prefer, the simpler approach of {skimr} vs the fancier reports of {DataExplorer.}
## Practice loading data from multiple file types
## Practice saving (writing to disk) data objects in formats including csv, rds, xls, xlsx and statistical program formats
## How do readr and readxl parse columns?
## What are the variable types?
| Variable Type | Long form | Abbreviation|
|---------------|-----------|:-----------:|
|Logical (TRUE/FALSE)| col_logical()| l |
|Integer | col_integer() | i |
|Double | col_double() | d |
|Character | col_character() | c |
|Factor (nominal or ordinal) | col_factor(levels, ordered) | f |
|Date | col_date(format) | D |
|Time | col_time(format) | t |
|Date & Time | col_datetime(format) | T |
|Number | col_number() | n |
|Don't import | col_skip() | - |
|Default Guessing | col_guess() | ? |
## Controlling Parsing
## Chapter Challenges
1. There is a file named "paulolol.xlsx" with the path of 'data/paulolol.xlsx'. A picture of this problematic file is shown below.<br>
![paulolol](images/paulolol.png)
- Read in this file with the {readxl} package. Just for fun, try this see how this turns out with no additional arguments.
- Be sure to skip the problematic non-data first few rows
- Be sure to exclude the problematic non-data calculations below the table.
Solution to Challenge 1: paulolol.xlsx
`r hide("Show Me Solution 1")`
```{r}
read_excel(path = 'data/paulolol.xlsx',
skip = 3,
n_max = 8)
```
`r unhide()`
2. Our intrepid Investigator has inserted a chart in place of his data table on sheet 1, and moved the data table to a 2nd sheet named 'data', and placed in the top left corner, at the suggestion of his harried statistician, in a new file with the path of 'data/paulolol2.xlsx"
- Try reading this file in with read_excel()
- read the help file `help(read_excel)` to figure out how to read in the data from this excel file.
Solution to Challenge 2: paulolol2.xlsx
`r hide("Show Me Solution 2")`
```{r}
read_excel(path = 'data/paulolol2.xlsx',
sheet = 'data')
```
`r unhide()`
## Future forms of data ingestion
- https://www.datacamp.com/community/tutorials/r-data-import-tutorial?utm_source=adwords_ppc&utm_campaignid=1658343521&utm_adgroupid=63833880415&utm_device=c&utm_keyword=%2Bread%20%2Bdata%20%2Br&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=469789579368&utm_targetid=aud-392016246653:kwd-309793905111&utm_loc_interest_ms=&utm_loc_physical_ms=9016851&gclid=Cj0KCQjwxJqHBhC4ARIsAChq4auwh82WzCiJsUDzDOiaABetyowW0CXmTLbUFkQmnl1pn4Op9xcCcdQaAhMWEALw_wcB
- reading data from the web with readr
- reading data from Google Sheets with {googlesheets}
- reading data from REDCap, and tidying
- reading data from web tables with rvest