-
Notifications
You must be signed in to change notification settings - Fork 10
/
slide_loading_data.Rmd
278 lines (201 loc) · 8.12 KB
/
slide_loading_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
title: "Reading (and writing) data in R"
subtitle: "R Foundations for Data Analysis"
author: "Marcin Kierczak"
keywords: "bioinformatics, course, scilifelab, nbis, R"
output:
xaringan::moon_reader:
encoding: 'UTF-8'
self_contained: false
chakra: 'assets/remark-latest.min.js'
css: 'assets/slide.css'
lib_dir: libs
include: NULL
nature:
ratio: '4:3'
highlightLanguage: r
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
slideNumberFormat: "%current%/%total%"
---
exclude: true
count: false
```{r,echo=FALSE,child="assets/header-slide.Rmd"}
```
<!-- ------------ Only edit title, subtitle & author above this ------------ -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, width=60)
```
```{r,echo=FALSE,message=FALSE,warning=FALSE}
# load the packages you need
#library(dplyr)
#library(tidyr)
#library(stringr)
#library(ggplot2)
#library(mkteachr)
```
---
name: reading_data
# Reading data
* Can be one of the most consuming and cumbersome aspects of data analysis.
* R provides ways to read and write data stored on different media (e.g.: file, database, url) and in different formats.
* Package `foreign` contains a number of functions to import less common data formats.
---
name: reading_tables
# Reading tables
We can use the `read.table()` function. It is a nice way to read your data into a data frame.
The function is declared in the following way:
```{r, echo=T, eval=F}
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
# or just
read.table(file)
```
---
name: read_table_params
# `read.table()` parameters
You can read all about the `read.table()` function using `?read.table`
The most important arguments are:
* **file** – the path to the file that contains data, e.g. `/path/to/my/file.csv`
* **header** – a logical indicating whether the first line of the file contains variable names,
* **sep** – a character determining variable delimiter, e.g. `","` for csv files,
* **quote** – a character telling R which character surrounds strings,
* **dec** – character determining the decimal separator,
* **row/col.names** – vectors containing row and column names,
* **na.strings** – a character used for missing data,
* **nrows** – how many rows should be read,
* **skip** – how many rows to skip,
* **as.is** – a vector of logicals or numbers indicating which columns shall not be converted to factors,
* **fill** – add NA to the end of shorter rows,
* **stringsAsFactors** – a logical. Rather self explanatory.
---
name: read_table_sibs
# `read.table` and its siblings
The `read.table` function has some siblings, functions with particular arguments pre-set to a specific value to spare some time:
* `read.csv()` and `read.csv2()` with comma and semicolon as default `sep` and dot and comma as `dec` respectively,
* `read.delim()` and `read.delim2()` for reading tab-delimited files.
We, however, most often use the canonical `read.table()` as it is the most flexible one.
---
name: read_table_example
# `read.table` — example use
```{r read.table, echo=T}
tab <- read.table(file = 'data/slide_loading_data/2014-04-07_phenos2.csv',
sep = ' ',
header = T)
tab[1:5, 1:3]
class(tab$reg_no)
```
---
name: handling_errors
# What if you encounter errors?
* R documentation `?` and `??`
* Google – just type R and copy the error you got without your variable names,
* Open the file using a text editor and see if you can spot anything unusual –
* e.g. has the header line the same number of columns as the first line?
--
# Useful terminal commands for debugging (Linux/OsX)
* `cat phenos.txt | awk -F';' '{print NF}'` prints the number of words in each row. `-F';'` says that semicolon is the delimiter,
* `head -n 5 phenos.txt` prints the 5 first lines of the file,
* `tail -n 5 phenos.txt` prints the 5 last lines of the file,
* `head -n 5 phenos.txt | tail -n 2` will print lines 4 and 5...
* `wc -l phenos.txt` will print the number of lines in the file
* `head -n 2 phenos.txt > test.txt` will write the first 2 lines to a new file
--
If it still does not give you a clue — just try to load first line of the file.
--
If this still did not help, split the file in two equal-size parts. Check which part gives the error. Split this part into halves and check which 1/4 gives the error... It is faster than you think!
---
name: writing
# Writing with `write.table()`
`read.table()` has its counterpart, the `write.table()` function (as well ass its siblings, like `write.csv()`). You can read more about it in the documentation, let us show some examples:
```{r write.table, echo=T, eval=F}
vec <- rnorm(10)
write.table(vec, '') # write to screen
write.table(vec, file = 'vector.txt')
# write to the system clipboard, handy!
write.table(vec, 'clipboard', col.names=F, row.names=F)
# or on OsX
clip <- pipe("pbcopy", "w")
write.table(vec, file=clip)
close(clip)
# To use in a spreadsheet
write.csv(vec, file = 'spreadsheet.csv')
```
---
name: write_big_data
# Writing big data
* HINT: `write.table()` is rather slow on big data – it checks types for every column and row and does separate formatting to each. If your data consists of only one type of data, convert it to a matrix using `as.matrix` before you write it!
* You may want to use function `scan()` that reads files as vectors. The content does not have to be in the tabular form. You can also use scan to read data from keyboard: `typed.data <- scan()`
* If data are written as fixed-width fields, use the `read.fwf()` function.
* Also check out the `readLines()` function that enables you to read data from any stream.
---
name: read_xls_matlab
# Read data in xls/xlsx and Matlab
```{r xls, eval=F, echo=T}
library(readxl)
data <- read_xlsx('myfile.xlsx')
```
```{r matlab, eval=F, echo=T}
library(R.matlab)
data <- readMat("mydata.mat")
```
---
name: remote_data
# Working with remote data
```{r url.data, eval=F, cache=T, echo=T}
url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_average_wage'
conn <- url(url, 'r')
raw.data <- readLines(conn)
raw.data[1:3]
```
But data is often tabularized...
```{r url.data.2, eval=F, cache=T, echo=T}
library(rvest)
html <- read_html(url)
tables <- html_nodes(html, 'table')
data <- html_table(tables[4])[[1]]
data[1:5, ]
```
---
name: databases
# Working with databases
It is also relatively easy to work with different databases. We will focus on MySQL and present only one example that uses the *RMySQL* package (check also *RODBC* and *RPostgreSQL*).
```{r db, echo=T, eval=F}
library(RMySQL)
db.conn <- dbConnect(MySQL(), user='me',
password='qwerty123',
dbname='genes',
host='127.0.0.237')
query <- dbSendQuery(db.conn, 'SELECT * FROM table7')
data <- fetch(query, n = - 1)
```
---
name: capabilities
# Capabilities
If you are getting some errors, e.g. trying to connect to a url, you may check whether your system (and R) support particular type of file or connection:
```{r capabil, echo=T, size='tiny'}
capabilities()
```
<!-- --------------------- Do not edit this and below --------------------- -->
---
name: end_slide
class: end-slide, middle
count: false
# See you at the next lecture!
```{r, echo=FALSE,child="assets/footer-slide.Rmd"}
```
```{r,include=FALSE,eval=FALSE}
# manually run this to render this document to HTML
#rmarkdown::render("presentation_demo.Rmd")
# manually run this to convert HTML to PDF
#pagedown::chrome_print("presentation_demo.html",output="presentation_demo.pdf")
```