-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathrbasic_20190906_166_hwa.Rmd
642 lines (479 loc) · 24.1 KB
/
rbasic_20190906_166_hwa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
---
output: html_notebook
---
```{r}
install.packages(c("tidyverse","dslabs"))
library(dslabs)
library(tidyverse)
```
# 3.2 The very basics
Before we get started with the motivating dataset, we need to cover the very basics of R.
### 3.2.1 Objects
Suppose a high school student asks us for help solving several quadratic equations of the form $ax^2+bx+c$ = 0.
The quadratic formula gives us the solutions:
$$\frac{-b\pm\sqrt{b^2 - 4ac}}{2a}$$
which of course change depending on the values of _a,b_ and _c_. One advantage of programming languages is that we can define variables and write expressions with these variables, similar to how we do so in math, but obtain a numeric solution. We will write out general code for the quadratic equation below, but if we are asked to solve $x^2+x-1=0$, then we define:
```{r}
a <- 1
b <- 1
c <- -1
```
which stores the values for later use. We use <- to assign values to the variables.
We can also assign values using = instead of <-, but we recommend using = to avoid confusion.
Copy and paste the code above into your console to define the three variables. Note that R does not print
anything when we make this assignment. This means the objects were defined successfully. Had you made
a mistake, you would have received an error message.
To see the value stored in a variable, we simply ask R to evaluate **a** and it shows the stored value:
```{r}
a
```
A more explicit way to ask R to show us the value stored in **a** is using **print** like this:
```{r}
print(a)
```
We use the term _object_ to describe stuff that is stored in R. Variables are examples, but objects can also be more complicated entities such as functions, which are described later.
### 3.2.2 The workspace
As we define objects in the console, we are actually changing the _workspace_. You can see all the variables saved in your workspace by typing:
```{r}
ls()
```
In RStudio, the _Environment_ tab shows the values:
We should see **a, b** and **c**. If you try to recover the value of a variable that is not in your workspace, you recieve an error. For example, if you type **x** you will receive the following message: **Error: object ’x’ not found.**
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
```{r}
(-b + sqrt(b^2 - 4*a*c) ) / (2*a)
(-b - sqrt(b^2 - 4*a*c) ) / (2*a)
```
### 3.2.3 Functions
Once you define variables, the data analysis process can usually be described as a series of _functions_ applied to the data. R includes several predefined functions and most of the analysis pipelines we construct make extensive use of these.
We already used the **install.packages**,**library**, and **ls** functions. We also used the function **sqrt** to solve the quadratic equation above. There are many more prebuilt functions and even more can be added through packages. These functions do not appear in the workspace because you did not define them, but they are available for immediate use.
In general, we need to use parentheses to evaluate a function. If you type **ls**, the function is not evaluated and instead R shows you the code that defines the function. If you type **ls()** the function is evaluated and, as seen above, we see objects in the workspace.
Unlike **ls**, most functions require one or more _arguments_. Below is an example of how we assign an object to the argument of the function **log**. Remember that we earlier defined **a** to be 1:
```{r}
log(8)
log(a)
```
You can find out what the function expects and what it does by reviewing the very useful manuals included in R. You can get help by using the **help** function like this:
```{r}
help("log")
```
For most functions, we can also use this shorthand:
```{r}
?log
```
The help page will show you what arguments the function is expecting. For example, log needs **x** and **base** to run. However, some arguments are required and others are optional. You can determine which arguments are optional by noting in the help document that a default value is assigned with =. Defining these is optional. For example, the base of the function **log** defaults to **base = exp(1)** making **log** the natural log by default.
If you want a quick look at the arguments without opening the help system, you can type:
```{r}
args(log)
```
You can change the default values by simply assigning another object:
```{r}
log(8, base = 2)
```
Note that we have not been specifying the argument **x** as such:
```{r}
log(x = 8, base = 2)
```
The above code works, but we can save ourselves some typing: if no argument name is used, R assumes you are entering arguments in the order shown in the help file or by **args**. So by not using the names, it assumes the arguments are **x** followed by **base**:
```{r}
log(8,2)
```
If using the arguments’ names, then we can include them in whatever order we want:
```{r}
log(base = 2, x = 8)
```
To specify arguments, we must use =, and cannot use <-.
There are some exceptions to the rule that functions need the parentheses to be evaluated. Among these, the most commonly used are the arithmetic and relational operators. For example:
```{r}
2 ^ 3
```
You can see the arithmetic operators by typing:
```{r}
help("+")
```
or
```{r}
?"+"
```
and the relational operators by typing:
```{r}
help(">")
```
or
```{r}
?">"
```
### 3.2.4 Other prebuilt objects
There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing:
```{r}
data()
```
This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type:
```{r}
co2
```
R will show you Mauna Loa atmospheric CO2 concentration data.
Other prebuilt objects are mathematical quantities, such as the constant $\pi$ and $\infty$:
```{r}
pi
Inf+1
```
### 3.2.5 Variable names
We have used the letters **a, b** and **c** as variable names, but variable names can be almost anything. Some basic rules in R are that variable names have to start with a letter, can’t contain spaces and should not be variables that are predefined in R. For example, don’t name one of your variables **install.packages** by typing something like **install.packages <- 2**.
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces. For the quadratic equations, we could use something like this:
```{r}
solution_1 <- (-b + sqrt(b^2 - 4*a*c)) / (2*a)
solution_2 <- (-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
For more advice, we highly recommend studying [Hadley Wickham’s style guide](http://adv-r.had.co.nz/Style.html).
### 3.2.6 Saving your workspace
Values remain in the workspace until you end your session or erase them with the function **rm**. But workspaces also can be saved for later use. In fact, when you quit R, the programs asks you if you want to save your
workspace. If you do save it, the next time you start R, the program will restore the workspace.
We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved. Instead, we recommend you assign the workspace a specific name. You can do this by using the function **save** or **save.image**. To load, use the function **load**. When saving a workspace, we recommend the suffix **rda** or **RData**. In RStudio, you can also do this by navigating to the _Session_ tab and choosing _Save Workspace as_. You can later load it using the _Load Workspace_ options in the same tab. You can read the help pages on **save, save.image** and **load** to learn more.
### 3.2.7 Motivating scripts
To solve another equation such as $3x^2+2x−1$, we can copy and paste the code above and then redefine the variables and recompute the solution:
```{r}
a <- 3
b <- 2
c <- -1
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
By creating and saving a script with the code above, we would not need to retype everything each time and, instead, simply change the variable names. Try writing the script above into an editor and notice how easy it is to change the variables and receive an answer.
### 3.2.8 Commenting your code
If a line of R code starts with the symbol #, it is not evaluated. We can use this to write reminders of why we wrote particular code. For example, in the script above we could add:
```{r}
## Code to compute solution to quadratic equation of the form ax^2 + bx + c
## define the variables
a <- 3
b <- 2
c <- -1
## now compute the solution
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
# 3.3 Exercises
1. What is the sum of the first 100 positive integers? The formula for the sum of integers 1 through $n$ is $n(n + 1)/2$. Define $n = 100$ and then use R to compute the sum of 1 through 100 using the formula.
What is the sum?
```{r}
n <- 100
n*(n+1)/2
```
2. Now use the same formula to compute the sum of integers from 1 through 1,000.
```{r}
n <- 1000
n*(n+1)/2
```
3. Look at the result of typing the following code into R:
```{r}
n <- 1000
x <- seq(1, n)
sum(x)
```
Based on the result, what do you think the functions **seq** and **sum** do? You can use the **help** system:
A. **sum** creates a list of numbers and **seq** adds them up.
**B**. **seq** creates a list of numbers and **sum** adds them up.
C. **seq** computes the difference between two arguments and **sum** computes the sum of 1 through 1000.
D. **sum** always returns the same number.
4. In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.
```{r}
log(sqrt(100),10)
```
5. Which of the following will always return the numeric value stored in **x**? You can try out examples and use the help system if you want.
A. log(10^x)
B. log10(x^10)
**C**. log(exp(x))
D. exp(log(x, base = 2))
```{r}
x <- 2
log(10^x)
log10(x^10)
log(exp(x))
exp(log(x, base = 2))
```
# 3.4 Data types
Variables in R can be of different types. For example, we need to distinguish numbers from character strings and tables from simple lists of numbers. The function **class** helps us determine what type of object we
have:
```{r}
a <- 2
class(a)
```
To work efficiently in R, it is important to learn the different types of variables and what we can do with these.
### 3.4.1 Data frames
Up to now, the variables we have defined are just one number. This is not very useful for storing data. The most common way of storing a dataset in R is in a _data frame_. Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.
A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading the **dslabs** library and loading the **murders** dataset using the **data** function:
```{r}
library(dslabs)
data(murders)
```
To see that this is in fact a data frame, we type:
```{r}
class(murders)
```
### 3.4.2 Examining an object
The funtion **str** is useful for finding out more about the structure of an object:
```{r}
str(murders)
```
This tells us much more about the object. We see that the table has 51 rows (50 states plus DC) and five variables. We can show the first six lines using the function **head**:
```{r}
head(murders)
```
In this dataset, each state is considered an observation and five variables are reported for each state.
Before we go any further in answering our original question about different states, let’s learn more about the components of this object.
### 3.4.3 The accessor: $
For our analysis, we will need to access the different variables represented by columns included in this data frame. To do this, we use the accessor operator $ in the following way:
```{r}
murders$population
```
But how did we know to use population? Previously, by applying the function **str** to the object murders, we revealed the names for each of the five variables stored in this table. We can quickly access the variable
names using:
```{r}
names(murders)
```
It is important to know that the order of the entries in **murders$population** preserves the order of the rows in our data table. This will later permit us to manipulate one variable based on the results of another. For example, we will be able to order the state names by the number of murders.
**Tip**: R comes with a very nice auto-complete functionality that saves us the trouble of typing out all the names. Try typing **murders$p** then hitting the _tab_ key on your keyboard. This functionality and many other useful auto-complete features are available when working in RStudio.
### 3.4.4 Vectors: numerics, characters, and logical
The object **murders$population** is not one number but several. We call these types of objects _vectors_. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function **length** tells you how many entries are in the vector:
```{r}
pop <- murders$population
length(pop)
```
This particular vector is _numeric_ since population sizes are numbers:
```{r}
class(pop)
```
In a numeric vector, every entry must be a number.
To store character strings, vectors can also be class _character_. For example, the state names are characters:
```{r}
class(murders$state)
```
As with numeric vectors, all entries in a character vector need to be a character.
Another important type of vectors are _logical vectors_. These must be either **TRUE** or **FALSE**.
```{r}
z <- 3 == 2
z
class(z)
```
Here the == is a relational operator asking if 3 is equal to 2. In R, if you just use one =, you actually assign a variable, but if you use two == you test for equality.
You can see the other _relational operators_ by typing:
```{r}
?Comparison
```
In future sections, you will see how useful relational operators can be.
We discuss more important features of vectors after the next set of exercises.
**Advanced**: Mathematically, the values in **pop** are integers and there is an integer class in R. However, by default, numbers are assigned class numeric even when they are round integers. For example, **class(1)** returns numeric. You can turn them into class integer with the **as.integer()** function or by adding an L like this: **1L**. Nothe the class by typing: **class(1L)**
### 3.4.5 Factors
In the **murders** dataset, we might expect the region to also be a character vector. However, it is not:
```{r}
class(murders$region)
```
It is a _factor_. Factors are useful for storing categorical data. We can see that there are only 4 regions by using the **levels** function:
```{r}
levels(murders$region)
```
In the background, R stores these _levels_ as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.
Note that the levels have an order that is different from the order of appearance in the factor object. The default is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. We will see several examples of this in the Data Visualization part of the book. The function **reorder** lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example.
Suppose we want the levels of the region by the total number of murders rather than alphabetical order. If there are values associated with each level, we can use the **reorder** and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.
```{r}
region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
```
The new order is in agreement with the fact that the Northeast has the least murders and the South has the most.
**Warning**: Factors can be a source of confusion since sometimes they behave like characters and sometimes they do not. As a result, confusing factors and characters are a common source of bugs.
### 3.4.6 Lists
Data frames are a special case of _lists_. We will cover lists in more detail later, but know that they are useful because you can store any combination of different types. Below is an example of a list we created for you:
```{r}
record <- list("John Doe", 1234, c(95,82,91,97,93), "A")
names(record) <- c("name", "student_id", "grades", "final_grade")
record2 <- list("John Doe", 1234, c(95,82,91,97,93), "A")
```
```{r}
record
class(record)
```
As with data frames, you can extract the components of a list with the accessor $. In fact, data frames are a type of list.
```{r}
record$student_id
```
We can also use double square brackets ([[) like this:
```{r}
record[["student_id"]]
```
You should get used to the fact that in R, there are often several ways to do the same thing, such as accessing entries.
You might also encounter lists without variable names:
```{r}
record2
```
If a list does not have names, you cannot extract the elements with $, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:
```{r}
record2[[1]]
```
We won’t be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here.
### 3.4.7 Matrices
Matrices are another type of object that are common in R. Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type. For this reason data frames are much more useful for
storing data, since we can have characters, factors and numbers in them.
Yet matrices have a major advantage over data frames: we can perform a matrix algebra operations, a powerful type of mathematical technique. We do not describe these operations in this book, but much of what happens in the background when you perform a data analysis involves matrices. We cover matrices in more detail in Chapter 34.1 but describe them briefly here since some of the functions we will learn return matrices.
We can define a matrix using the **matrix** function. We need to specify the number of rows and columns.
```{r}
mat <- matrix(1:12, 4, 3)
mat
```
You can access specific entries in a matrix using square brackets ([). If you want the second row, third column, you use:
```{r}
mat[2, 3]
```
If you want the entire second row, you leave the column spot empty:
```{r}
mat[2, ]
```
Notice that this returns a vector, not a matrix.
Similarly, if you want the entire third column, you leave the row spot empty:
```{r}
mat[, 3]
```
This is also a vector, not a matrix.
You can access more than one column or more than one row if you like. This will give you a new matrix.
```{r}
mat[,2:3]
```
You can subset both rows and columns:
```{r}
mat[1:2, 2:3]
```
We can convert matrices into data frames using the function **as.data.frame**:
```{r}
as.data.frame(mat)
```
You can also use single square brackets ([) to access rows and columns of a data frame:
```{r}
data("murders")
murders[25, 1]
murders[2:3,]
```
# 3.5 Exercises
1. Load the US murders dataset.
```{r}
library(dslabs)
data(murders)
```
Use the function **str** to examine the structure of the **murders** object. We can see that this object is a data frame with 51 rows and five columns. Which of the following best describes the variables represented in this data frame?
A. The 51 states
B. The murder rates for all 50 states and DC.
**C**. The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
D. **str** shows no relevant information.
```{r}
str(murders)
```
2. What are the column names used by the data frame for these five variables?
```{r}
names(murders)
```
3. Use the accessor $ to extract the state abbreviations and assign them to the object **a**. What is the class of this object?
```{r}
a <- murders$abb
class(a)
```
4. Now use the square brackets to extract the state abbreviations and assign them to the object **b**. Use the **identical** function to determine if **a**and **b** are the same.
```{r}
b <- murders[["abb"]]
identical(a,b)
```
5. We saw that the **region** column stores a factor. You can corroborate this by typing:
```{r}
class(murders$region)
```
With one line of code, use the function **levels** and **length** to determine the number of regions defined by this dataset.
```{r}
length(levels(murders$region))
```
6. The function **table** takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.
```{r}
table(murders$region)
```
# 3.6 Vectors
In R, the most basic objects available to store data are _vectors_. As we have seen, complex datasets can usually be broken down into components that are vectors. For example, in a data frame, each column is a vector. Here we learn more about this important class.
### 3.6.1 Creating vectors
We can create vectors using the function **c**, which stands for _concatenate_. We use **c** to concatenate entries in the following way:
```{r}
codes <- c(380, 124, 818)
codes
```
We can also create character vectors. We use the quotes to denote that the entries are characters rather than variable names.
```{r}
country <- c("italy","canada","egypt")
```
In R you can also use single quotes:
```{r}
country <- c('italy','canada','egypt')
```
But be careful not to confuse the single quote ’ with the _back quote_ ‘.
By now you should know that if you type:
```{r}
country <- c(italy, canada, egypt)
```
you recieve and error because the variables **italy, canada** and **egypt** are not defined. If we do not use the quotes, R looks for variables with those names and returns an error.
### 3.6.2 Names
Sometimes it is useful to name the entries of a vector. For example, when defining a vector of country codes, we can use the names to connect the two:
```{r}
codes <- c(italy = 380, canada = 124, egypt = 818)
codes
```
The object codes continues to be a numeric vector:
```{r}
class(codes)
```
but with names:
```{r}
names(codes)
```
If the use of strings without quotes looks confusing, know that you can use the quotes as well:
```{r}
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
codes
```
There is no difference between this function call and the previous one. This is one of the many ways in which R is quirky compared to other languages.
We can also assign names using the **names** functions:
```{r}
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country
codes
```
### 3.6.3 Sequences
Another useful function for creating vectors generates sequences:
```{r}
seq(1, 10)
```
The first argument defines the start, and the second defines the end which is included. The default is to go up in increments of 1, but a third argument lets us tell it how much to jump by:
```{r}
seq(1, 10, 2)
```
If we want consecutive integers, we can use the following shorthand:
```{r}
1:10
```
When we use these functions, R produces integers, not numerics, because they are typically used to index something:
```{r}
class(1:10)
```
However, if we create a sequence including non-integers, the class changes:
```{r}
class(seq(1, 10, 0.5))
```
### 3.6.4 Subsetting
We use square brackets to access specific elements of a vector. For the vector **codes** we defined above, we can access the second element using:
```{r}
codes[c(1,3)]
```
The sequences defined above are particularly useful if we want to access, say, the first two elements:
```{r}
codes[1:2]
```
If the elements have names, we can also access the entries using these names. Below are two examples.
```{r}
codes["canada"]
codes[c("egypt","italy")]
```