forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-prelim.Rmd
1735 lines (1356 loc) · 77.7 KB
/
01-prelim.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Part I: Preliminaries {-}
```{r echo = FALSE}
library(fortunes)
```
# R Preliminaries
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1.pdf) a pdf of the lecture slides covering this topic.
## Objectives
After this chapter, you should:
- Know what free and open source software is and some of its advantages over proprietary software
- Understand the difference between R and RStudio
- Be able to download both R and RStudio to your own computer
- Understand that R has a basic core of code that you initially download, and
that this "base R" can be expanded by installing a variety of packages
- Be able to install a package from CRAN to your computer
- Be able to load a package that you have installed to use its functions within an R session
- Be able to access help documentation (vignettes, helpfiles) for a package and its functions
- Be able to submit R expressions at the console prompt to communicate with R
- Understand the structure for calling a function and specifying options for that function
- Know what an R object is and how to assign an R object a name to reference it in later code
- Be able to create vector objects of numeric and character classes
- Be able to explore and extract elements from vector objects
- Be able to create dataframe objects
- Be able to explore and extract elements from dataframe objects
- Be able to describe the difference between running R code from the console
versus writing and running R code in an R script
## R and R Studio
### What is R?
R in an open-source programming language that evolved from the S language. The S
language was developed at Bell Labs in the 1970s, which is the same place (and
about the same time) that the C programming language was developed.
R itself was developed in the 1990s--2000s at the University of Auckland. It is
open-source software, freely and openly distributed under the GNU General Public
License (GPL). The base version of R that you download when you install R on
your computer includes the critical code for running R, but you can also install
and run "packages" that people all over the world have developed to extend R.
With new developments, R is becoming more and more useful for a variety of
programming tasks. However, where it really shines is in working with data and
doing statistical analysis. R is currently popular in a number of fields,
including:
- Statistics
- Machine learning
- Data analysis
R is an **interpreted language**. That means that you can communicate with it
interactively, from a command line. Other common interpreted languages include
Python and Perl.
```{r interpreted-language, echo = FALSE, out.width = "600pt", fig.align = "center", fig.cap = "Broad types of software programs. R is an interpreted language. 'Point-and-click' programs, like Excel and Word, are often easiest for a new user to get started with, but are slower for the computer and are restricted in the functionality they offer. By contrast, compiled languages (like C and Java), assembly languages, and machine code are faster for the computer and allow you to create a wider range of things, but can take longer to code and take longer for a new user to learn to work with."}
knitr::include_graphics("figures/program_types2.jpg")
```
R has some of the same strengths (quick and easy to code, interfaces well with
other languages, easy to work interactively) and weaknesses (slower than
compiled languages) as Python. For data-related tasks, R and Python are fairly
neck-and-neck (with Julia an up-and-coming option). However, R is still the
first choice of statisticians in most fields, so I would argue that R has a an
advantage if you want to have access to cutting-edge statistical methods.
> "The best thing about R is that it was developed by statisticians. The worst thing about R is that... it was developed by statisticians."
> -Bo Cowgill, Google, at the Bay Area R Users Group
### Free and open-source software
> "Life is too short to run proprietary software." -- Bdale Garbee
R is **free and open-source software**. Many other popular statistical
programming languages, conversely, are proprietary (for example, SAS and SPSS).
It's useful to know what it means for software to be "open-source", both
conceptually and in terms of how you will be able to use and add to R in your
own work.
R is free, and it's tempting to think of open-source software just as "free
software". Things, however, are a little more subtle than that. It helps to
consider some different meanings of the word "free". "Free" can mean:
- *Gratis*: Free as in beer
- *Libre*: Free as in speech
```{r open-source-overview, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "An overview of how software can be each type of free (beer and speech). For software programs developed using a compiled programming language, the final product that you open on your computer is run by machine-readable binary code. A developer can give you this code for free (as in beer) without sharing any of the original source code with you. This means you can't dig in to figure out how the software works and how you can extend it. By contrast, open-source software (free as in speech) is software for which you have access to the human-readable code that was used as in input in creating the software binaries. With open-source code, you can figure out exactly how the program is coded."}
knitr::include_graphics("figures/OpenSourceOverview.png")
```
Open-source software software is the *libre* type of free (Figure
\@ref(fig:open-source-overview)). This means that, with software that is
open-source, you can:
- Access all of the code that makes up the software
- Change the code as you'd like for your own applications
- Build on the code with your own extensions
- Share the software and its code, as well as your extensions, with others
Often, open-source software is also free, making it "free and open-source software",
or "FOSS".
Popular open source licenses for R and R packages include the GPL and MIT licenses.
> “Making Linux GPL'd was definitely the best thing I ever did.” -- Linus Torvalds
In practice, this means that, once you are familiar with the software, you can
dig deeply into the code to figure out exactly how it's performing certain
tasks. This can be useful for finding bugs and eliminating bugs, and also can
help researchers figure out if there are any limitations in how the code works
for their specific research.
It also means that you can build your own software on top of existing R software
and its extensions. I explain a bit more about R packages a bit later, but this
open-source nature of R (and other languages, including Python) has created a
large community of people worldwide who develop and share extensions to R. As a
result, you can pull in packages that let you do all kinds of things in R, like
visualizing Tweets, cleaning up accelerometer data, analyzing complex surveys,
fitting maching learning models, and a wealth of other cool things.
> "Despite its name, open-source software is less vulnerable to hacking than the secret, black box systems like those being used in polling places now. That’s because anyone can see how open-source systems operate. Bugs can be spotted and remedied, deterring those who would attempt attacks. This makes them much more secure than closed-source models like Microsoft’s, which only Microsoft employees can get into to fix." -- [Woolsey and Fox. *To Protect Voting, Use Open-Source Software.* New York Times. August 3, 2017.](https://www.nytimes.com/2017/08/03/opinion/open-source-software-hacker-voting.html?mcubz=3)
You can download the latest version of R from
[CRAN](https://cran.r-project.org). Be sure to select the distribution for your
type of computer system. R is updated occasionally; you should plan to
re-install R at least once a year, to make sure you're working with one of the
newer versions. Check your current R version (one way is by running
`sessionInfo()` at the R console) to make sure you're not using an outdated
version of R. Defaults should be fine for everything.
> "The R engine ... is pretty well uniformly excellent code but you
have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it."
- Bill Venables, R-help (January 2004)
> “Talk is cheap. Show me the code.” - Linus Torvalds
### What is RStudio?
To get the R software, you'll [download R](https://www.r-project.org) from the R
Project for Statistical Computing. This is enough for you to use R on your own
computer. However, I would suggest one additional, free piece of software to
improve your experience while working with R, RStudio.
RStudio is an integrated development environment (IDE) for R. This basically
means that it provides you an interface for running R and coding in R, with a
lot of nice extras that will make your life easier.
You download RStudio separately from R---you'll want to download and install R
itself first, and then you can [download
RStudio](https://www.rstudio.com/products/rstudio/download2/). You want the
Desktop version with the free license. Defaults should be fine for everything.
RStudio (the company) is a leader in the R community. Currently, the company:
- Develops and freely provides the RStudio IDE
- Provides excellent resources for learning and using R (e.g., cheatsheets, free
online books)
- Is producing some of the most-used R packages
- Employs some of the top people in R development
- Is a key member of The R Consortium (others include Microsoft, IBM, and Google)
R has been advancing by leaps in bounds in terms of what it can do and the
elegance with which it does it, in large part because of the enormous
contributions of people involved with RStudio.
## Communicating with R
Because R is an interpreted language, you can communicate with it interactively. You do
this using the following general steps:
1. Open an **R session**
2. At the **prompt** in the **console**, enter an **R expression**
3. Read R's "response" (the **output**)
4. Repeat 2 and 3
5. Close the R session
### R sessions, the console, and the command prompt
An **R session** is an instance of you using R. To open an R session,
double-click on the icon for "RStudio" on you computer. When RStudio opens, you
will be in a "fresh" R session, unless you restore a saved session (which I
strongly recommend against). This means that, once you open RStudio, you will
need to "set up" your session, including loading any packages you need (which
we'll talk about later) and reading in any data (which we'll also talk about).
In RStudio, there screen is divided into several "panes". We'll start with the
pane called "Console". The **console** lets you "talk" to R. This is where you
can "talk" to R by typing an **expression** at the **prompt** (the caret symbol,
">"). You press the "Return" key to send this message to R.
```{r r-console, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Finding the 'Console' pane and the command prompt in RStudio."}
knitr::include_graphics("figures/r_console.jpg")
```
Once you press "Return", R will respond in one of three ways:
1. R does whatever you asked it to do with the expression and prints the output
(if any) of doing that, as well as a new prompt so you can ask it something new
2. R doesn't think you've finished asking you something, and instead of giving you
a new prompt (">") it gives you a "+". This means that R is still listening, waiting
for you to finish asking it something.
3. R tries to do what you asked it to, but it can't. It gives you an **error message**,
as well as a new prompt so you can try again or ask it something new.
### R expressions, function calls, and objects
To "talk" with R, you need to know how to give it a complete **expression**.
Most expressions you'll want to give R will be some combination of two elements:
1. **Function calls**
2. **Object assignments**
We'll go through both these pieces and also look at how you can combine them
together for some expressions.
According to John Chambers, one of the creators of R's precursor S:
1. Everything that exists in R is an **object**
2. Everything that happens in R is a **call to a function**
In general, function calls in R take the following structure:
```{r eval = FALSE}
## Generic code (this won't run)
function_name(formal_argument_1 = named_argument_1,
formal_argument_2 = named_argument_2,
[etc.])
```
```{block, type = "rmdwarning"}
Sometimes, we'll show "generic" code in a code block, that doesn't actually work if you put it in R, but instead shows the generic structure of an R call. We'll try to always include a comment with any generic code, so you'll know not to try to run it in R.
```
A function call forms a complete R expression, and the output will
be the result of running `print` or `show` on the object that is output
by the function call. Here is an example of this structure:
```{r}
print(x = "Hello world")
```
Figure \@ref(fig:function-call) shows an example of the typical elements of a
function call. In this example, we're **calling** a function with the **name**
`print`. It has one **argument**, with a **formal argument** of `x`, which in
this call we've provided the **named argument** "Hello world".
```{r function-call, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Main parts of a function call. This example is calling a function with the name 'print'. The function call has one argument, with a formal argument of 'x', which in this call is provided the named argument 'Hello world'."}
knitr::include_graphics("figures/function_call.jpg")
```
The **arguments** are how you customize the call to an R function. For example,
you can use change the named argument value to print different messages with the
`print` function:
```{r}
print(x = "Hello world")
print(x = "Hi Fort Collins")
```
Some functions do not require any arguments. For example, the `getRversion` function will
print out the version of R you are using.
```{r}
getRversion()
```
Some functions will accept multiple arguments. For example, the `print` function allows you
to specify whether the output should include quotation marks, using the `quote`
formal argument:
```{r}
print(x = "Hello world", quote = TRUE)
print(x = "Hello world", quote = FALSE)
```
Arguments can be **required** or **optional**.
For a required argument, if you don't provide a value for the argument when you
call the function, R will respond with an error. For example, `x` is a **required argument**
for the `print` function, so if you try to call the function without it, you'll get an
error:
```{r eval = FALSE}
print()
```
```
Error in print.default() : argument "x" is
missing, with no default
```
For an **optional argument** on the other hand, R knows a **default value** for that
argument, so if you don't give it a value for that argument, it will just use the
default value for that argument.
For example, for the `print` function, the `quote` argument has the default value
`TRUE`. So if you don't specify a value for that argument, R will assume it should
use `quote = TRUE`. That's why the following two calls give the same result:
```{r}
print(x = "Hello world", quote = TRUE)
print(x = "Hello world")
```
Often, you'll want to find out more about a function, including:
- Examples of how to use the function
- Which arguments you can include for the function
- Which arguments are required versus optional
- What the default values are for optional arguments.
You can find out all this information in the function's **helpfile**, which
you can access using the function `?`. For example, the `mean` function will let you calculate the mean (average) of a
group of numbers. To find out more about this function, at the console type:
```{r eval = FALSE}
?mean
```
This will open a helpfile in the "Help" pane in RStudio. Figure
\@ref(fig:helpfile) shows some of the key elements of an example helpfile, the
helpfile for the `mean` function. In particular, the "Usage" section helps you
figure out which arguments are **required** and which are **optional** in the
Usage section of the helpfile.
```{r helpfile, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Navigating a helpfile. This example shows some key parts of the helpfile for the 'mean' function."}
knitr::include_graphics("figures/helpfile_arguments.jpg")
```
There's one class of functions that looks a bit different from others. These are
the infix **operator** functions. Instead using parentheses after the function
name, they usually go *between* two arguments. One common example is the `+`
operator:
```{r}
2 + 3
```
There are operators for several mathematical functions: `+`, `-`, `*`, `/`.
There are also other operators, including **logical operators** and **assignment
operators**, which we'll cover later.
In R, a variety of different types and structures of data can be saved in what's
called **objects**. For right now, you can just think of an R object as a discrete
container of data in R.
Function calls will produce an object. If you just
call a function, as we've been doing, then R will respond by printing out that
object. However, we'll often want to use that object some more. For example, we
might want to use it as an argument later in our "conversation" with R, when we
call another function later. If you want to re-use the results of a function
call later, you can **assign** that **object** to an **object name**. This kind
of expression is called an **assignment expression**.
Once you do this, you can use that *object name* to refer to the object. This
means that you don't need to re-create the object each time you need
it---instead you can create it once and then just reference it by name each time
you need it after that. For example, you can read in data from an external file
as a dataframe object and assign it an object name. Then, when you need that
data later, you won't need to read it in again from the external file.
The **gets arrow**, `<-`, is R's assignment operator. It takes whatever you've
created on the right hand side of the `<-` and saves it as an object with the
name you put on the left hand side of the `<-` :
```{r eval = FALSE}
## Note: Generic code-- this will not work
[object name] <- [object]
```
For example, if I just type `"Hello world"`, R will print it back to me, but
won't save it anywhere for me to use later:
```{r}
"Hello world"
```
However, if I assign it to an object, I can "refer" to that object in a later expression.
For example, the code below assigns the **object** `"Hello world"` the **object name** `message`.
Later, I can just refer to this object using the name `message`, for example in a function
call to the `print` function:
```{r}
message <- "Hello world"
print(x = message)
```
When you enter an **assignment expression** like this at the R console, if everything
goes right, then R will "respond" by giving you a new prompt, without any kind of
message.
However, there are three ways you can check to make sure that the object was
assigned to the object name:
1. Enter the object's name at the prompt and press return. The default if you do this
is for R to "respond" by calling the `print` function with that object as the `x`
argument.
2. Call the `ls` function (which doesn't require any arguments). This will list all the
object names that have been assigned in the current R session.
3. Look in the "Environment" pane in RStudio. This also lists all the object names that
have been assigned in the current R session.
Here's are examples of these strategies:
1. Enter the object's name at the prompt and press return:
```{r}
message
```
2. Call the `ls` function:
```{r}
ls()
```
3. Look in the "Environment" pane in RStudio (see Figure \@ref(fig:environment)).
```{r environment, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "'Environment' pane in RStudio. This shows the names and first few values of all objects that have been assigned to object names in the global environment."}
knitr::include_graphics("figures/environment_pane.jpg")
```
You can make assignments in R using either the gets arrow (`<-`) or `=`. When
you read other people's code, you'll see both. R gurus advise using `<-` rather
than `=` when coding in R, and as you move to doing more complex things, some
subtle problems might crop up if you use `=`. I have heard from someone in the
know that you can tell the age of a programmer by whether he or she uses the
gets arrow or `=`, with `=` more common among the young and hip. For this
course, however, I am asking you to code according to [Hadley Wickham's R style
guide](http://adv-r.had.co.nz/Style.html), which specifies using the gets arrow
for assignment.
While you will be coding with the gets arrow exclusively in this course, it will
be helpful for you to know that the two assignment arrows do pretty much the
same thing:
```{r}
one_to_ten <- 1:10
one_to_ten
one_to_ten = 1:10
one_to_ten
```
While the gets arrow takes two key strokes instead of one (like the equals
sign), you can somewhat get around this limitation by using RStudio's keyboard
shortcut for the gets arrow. This shortcut is Alt + - on Windows and Option + -
on Macs. To see a full list of RStudio keyboard shortcuts, go to the "Help" tab
in RStudio and select "Keyboard Shortcuts".
There are some absolute **rules** for the names you can use for an object name:
- Use only letters, numbers, and underscores
- Don't start with anything but a letter
If you try to assign an object to a name that doesn't follow the "hard" rules,
you'll get an error. For example, all of these expressions will give you an
error:
```{r eval = FALSE}
1message <- "Hello world"
_message <- "Hello world"
message! <- "Hello world"
```
In addition to these fixed rules, there are also some guidelines for naming
objects that you should adopt now, since they will make your life easier as you
advance to writing more complex code in R. The following three guidelines for
naming objects are from [Hadley Wickham's R style
guide](http://adv-r.had.co.nz/Style.html):
- Use lower case for variable names (`message`, not `Message`)
- Use an underscore as a separator (`message_one`, not `messageOne`)
- Avoid using names that are already defined in R (e.g., don't name an object
`mean`, because a `mean` function exists)
> "Don't call your matrix 'matrix'. Would you call your dog 'dog'? Anyway, it
might clash with the function 'matrix'." - Barry Rowlingson, R-help (October 2004)
Another good practice is to name objects after nouns (e.g., `message`) and
later, when you start writing functions, name those after verbs (e.g.,
`print_message`). You'll want your object names to be short enough that they
don't take forever to type as you're coding, but not so short that you can't
remember what they stand for.
```{block, type = "rmdtip"}
Sometimes, you'll want to create an object that you won't want to keep for very
long. For example, you might want to create a small object to test some code,
but you plan to not need the object again once you've done that. You may want to
come up with some short, generic object names that you use for these kinds of
objects, so that you'll know that you can delete them without problems when you
want to clean up your R session.
There are all kinds of traditions for these placeholder variable names in
computer science. `foo` and `bar` are two popular choices, as are, evidently,
`xyzzy`, `spam`, `ham`, and `norf`. There are different placeholder names in
different languages: for example, `toto`, `truc`, and `azerty` (French); and
`pippo`, `pluto`, `paperino` (Disney character names; Italian). See the
Wikipedia page on [metasyntactic
variables](https://en.wikipedia.org/wiki/Metasyntactic_variable) to find out
more.
```
What if you want to "compose" a call from more than one function call? One way
to do it is to assign the output from the first function call to a name and then
use that name for the next call. For example:
```{r}
message <- paste("Hello", "world")
print(x = message)
```
If you give two objects the same name, the most recent definition will be used (i.e., objects can be overwritten by assigning new content to the same object name). For example:
```{r}
a <- 1:10
b <- LETTERS [1:3]
a
b
a <- b
a
```
To create an R expression you can "nest" one function call inside another
function call. For example:
```{r}
print(x = paste("Hello", "world"))
```
Just like with math, the order that the functions are evaluated moves from the
inner set of parentheses to the outer one (Figure
\@ref(fig:composing-functions)). There's one more way we'll look at later called
"piping".
```{r composing-functions, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "'Environment' pane in RStudio. This shows the names and first few values of all objects that have been assigned to object names in the global environment."}
knitr::include_graphics("figures/composing_function_calls.jpg")
```
## R scripts
This is a good point in learning R for you to start putting your code in R
scripts, rather than entering commands at the console.
An R script is a plain text file where you can save a series of R commands. You
can save the script and open it up later to see (or re-do) what you did earlier,
just like you could with something like a Word document when you're writing a
paper.
To open a new R script in RStudio, go to the menu bar and select "File" -> "New
File" -> "R Script". Alternatively, you can use the keyboard shortcut
Command-Shift-N. Figure \@ref(fig:rscript) gives an example of an R script file
opened in RStudio and points out some interesting elements.
```{r rscript, echo = FALSE, fig.align="center", fig.cap = "Example of an R script in RStudio.", out.width= "600pt"}
knitr::include_graphics("figures/ExampleOfRScript.jpg")
```
To save a script you're working on, you can click on the "Save" button (which
looks like a floppy disk) at the top of your R script window in RStudio or use
the keyboard shortcut Command-S. You should save R scripts using a ".R" file
extension.
Within the R script, you'll usually want to type your code so there's one
command per line. If your command runs long, you can write a single call over
multiple lines. It's unusual to put more than one command on a single line of a
script file, but you can if you separate the commands with semicolons (`;`).
These rules all correspond to how you can enter commands at the console.
Running R code from a script file is very easy in RStudio. You can use either
the "Run" button or Command-Return, and any code that is selected (i.e., that
you've highlighted with your cursor) will run at the console. You can use this
functionality to run a single line of code, multiple lines of code, or even just
part of a specific line of code. If no code is highlighted, then R will instead
run all the code on the line with the cursor and then move the cursor down to
the next line in the script.
You can also run all of the code in a script. To do this, use the "Source"
button at the top of the script window. You can also run the entire script
either from the console or from within another script by using the `source()`
function, with the filename of the script you want to run as the argument. For
example, to run all of the code in a file named "MyFile.R" that is saved in your
current working directory, run:
```{r, eval = FALSE}
source("MyFile.R")
```
You can add comments into an R script to let others know (and remind yourself)
what you're doing and why. To do this, use R's comment character, `#`. Any line
on a script line that starts with `#` will not be read by R. You can also take
advantage of commenting to comment out certain parts of code that you don't want
to run at the moment.
While it's generally best to write your R code in a script and run it from there
rather than entering it interactively at the R console, there are some
exceptions. A main example is when you're initially checking out a dataset, to
make sure you've read it in correctly. It often makes more sense to run commands
for this task, like `str()`, `head()`, `tail()`, and `summary()`, at the
console. These are all examples of commands where you're trying to look at
something about your data **right now**, rather than code that builds toward
your analysis, or helps you read in or clean up your data.
### Commenting code
Sometimes, you'll want to include notes in your code. You can do this in all
programming languages by using a *comment character* to start the line with your
comment. In R, the comment character is the hash symbol, `#`. R will skip any
line that starts with `#` in a script. For example, if you run the following
code:
```{r}
# Don't print this.
"But print this"
```
R will only print the second, uncommented line.
You can also use a comment in the middle of a line, to add a note on what you're
doing in that line of the code. R will skip any part of the code from the hash
symbol on. For example:
```{r}
"Print this" ## But not this, it's a comment.
```
There's typically no reason to use code comments when running commands at the R
console. However, it's very important to get in the practice of including
meaningful comments in R scripts. This helps you remember what you did when you
revisit your code later.
> “You know you're brilliant, but maybe you'd like to understand what you did 2 weeks from now.” -- Linus Torvalds
## The "package" system
### R packages
> "Any doubts about R's big-league status should be put to rest, now that we have a Sudoku Puzzle Solver. Take that, SAS!"
- David Brahm (announcing the sudoku package), R-packages (January 2006)
Your original download of R is only a starting point. You can expand
functionality of R with what are called *packages*, or extensions with new code
and functionality that add to the basic "base R" environment. To me, this is a
bit like the toy train set that my son was obsessed with for a while. You first
buy a very basic set that looks something like Figure
\@ref(fig:toy-train-basic).
```{r toy-train-basic, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap = "The toy version of base R."}
knitr::include_graphics("figures/TrainBasic.JPG")
```
To take full advantage of R, you'll want to add on packages. In the case of the
train set, at this point, a doting grandparent adds on extensively through
birthday presents, so you end up with something that looks like Figure
\@ref(fig:toy-train-fancy).
```{r toy-train-fancy, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap = "The toy version of what your R set-up will look like once you find cool packages to use for your research."}
knitr::include_graphics("figures/TrainComplex.JPG")
```
Each package is basically a bundle of extra R functions. They may also include
help documentation, datasets, and some other objects, but typically the heart of
an R package is the new functions it provides.
You can get these "add-on" packages in a number of ways. The main source for
installing packages for R remains the Comprehensive R Archive Network, or
[CRAN](https://cran.r-project.org). However, [GitHub](https://github.com) is
growing in popularity, especially for packages that are still in development.
You can also create and share packages among your collaborators or co-workers,
without ever posting them publicly. In the "Advanced" section of this course,
you will learn some about writing your own R package.
### Installing from CRAN
```{r cran10000, echo = FALSE, out.width = "600pt", fig.align = "center", fig.cap = "Celebrating CRAN's 10,000th package."}
knitr::include_graphics("figures/CRAN_package_10000.png")
```
The most popular place from which to get packages is currently CRAN, which has
over 10,000 R packages available (Figure \@ref(fig:cran10000)). You can install
packages from CRAN using R code, with the `install.packages` function. For
example, telephone keypads include letters for each number (Figure
\@ref(fig:phone-keypad)), which allow companies to have "named" phone numbers
that are easier for people to remember, like 1-800-GO-FEDEX and 1-800-FLOWERS.
```{r phone-keypad, echo = FALSE, out.width = "150pt", fig.align = "center", fig.cap="Telephone keypad with letters corresponding to each number."}
knitr::include_graphics("figures/telephone_keypad.png")
```
The `phonenumber` package is a cool little package that will convert between
numbers and letters based on the telephone keypad. Since this package is on
CRAN, you can install the package to your computer using the `install.packages`
function:
```{r, eval = FALSE, messages = FALSE, warnings = FALSE, results = FALSE}
install.packages(pkgs = "phonenumber")
```
This downloads the package from CRAN and saves it in a special location on your
computer where R can load it when you're ready to use it. Once you've installed
a package to your computer this way, you don't need to re-run this
`install.packages` for the package ever again (unless the package maintainer
posts an updated version).
Just like R itself, packages often evolve and are updated by their maintainers.
You should update your packages as new versions come out. Typically, you have to
reinstall packages when you update your version of R, so this is a good chance
to get the most up-to-date version of the packages you use.
### Loading an installed package
Once you have installed a package, it will be saved to your computer. However,
you won't be able to access its functions within an R session until you *load*
it in that R session. Loading a package essentially makes all of the package's
functions available to you.
You can load a package in an R session using the
`library` function, with the package name inside the parentheses.
```{r messages = FALSE, warnings = FALSE, results = FALSE}
library(package = "phonenumber")
```
Figure \@ref(fig:install-vs-load) provides a conceptual
picture of the different steps of installing and loading a package.
```{r install-vs-load, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Install a package (with 'install.packages') to get it onto your computer. Load it (with 'library') to get it into your R session."}
knitr::include_graphics("figures/install_vs_library.jpg")
```
Once a package is loaded, you can use all its exported (i.e., public) functions
by calling them directly. For example, the `phonenumber` has a function called
`letterToNumber` that converts a character string to a number. If you have not
loaded the `phonenumber` package in your current R session and try to use this
function, you will get an error. However, once you've loaded `phonenumber` using
the `library` function, you can use this function in your R session:
```{r}
fedex_number <- "GoFedEx"
letterToNumber(value = fedex_number)
```
```{block, type = "rmdnote"}
R vectors can have several different *classes*. One common class is the
character class, which is the class of the character string we're using here
("GoFedEx"). You'll always put character strings in quotation marks. Another key
class is numeric (numbers). Later in the course, we'll introduce other classes
that vectors can have, including factors and dates. For the simplest vector
classes, these classes are determined by the type of data that the vector
stores.
```
When you open RStudio, unless you reload the history of a previous R session
(which I typically strongly **do not** recommend), you will start your work in a
"fresh" R session. This means that, once you open RStudio, you will need to run
the code to load any packages, define any objects, and read in any data that you
will need for analysis in that session.
If you are using a package in academic research, you should cite it, especially
if it implements an algorithm or method that is not standard. You can use the
`citation` function to get the information you need about how to cite a package:
```{r}
citation(package = "phonenumber")
```
```{block, type = "rmdnote"}
We've talked here about loading packages using the `library` function to access
their functions. However, this is not the only way to access the package's
functions. The syntax `[package name]::[function name]` (e.g.,
`phonenumber::letterToNumber(fedex)`) will allow you to use a function from a
package you have installed on your computer, even if its package has not been
loaded in the current R session. Typically, this syntax is not used much in data
analysis scripts, in part because it makes the code much longer. However, you
will occassionally see it used to distinguish between two functions from
different packages that have the same name, as this format makes the desired
function unambiguous. One example where this syntax often is needed is when both
`plyr` and `dplyr` packages are loaded in an R session, since these share
functions with the same name.
```
Packages typically include some documentation to help users. These include:
- **Package vignettes**: Longer, tutorial-style documents that walk the user
through the basics of how to use the package and often give some helpful example
cases of the package in use.
- **Function helpfiles**: Files for each external function (i.e., the package
maintainer wants it to be used by others) within the package, following an
established structure. These include information about what inputs are required
and optional for the function, what output will be created, and what options can
be selected by the user. In many cases, these also include examples of using the
function.
To determine which vignettes are available for a package, you can use the
`vignette` function, with the package's name specified for the `package` option:
```{r eval = FALSE}
vignette(package = "phonenumber")
```
From the output of this, you can call any of the package's vignettes directly.
For example, the previous call tells you that this package only has one
vignette, and that vignette has the same name as the package ("phonenumber").
Once you know the name of the vignette you would like to open, you can also use
`vignette` to open it:
```{r eval = FALSE}
vignette(topic = "phonenumber")
```
To access the helpfile for any function within a package you've loaded, you can
use `?` followed by the function's name:
```{r eval = FALSE}
?letterToNumber
```
## R's most basic object types
An R object stores some type of data that you want to use later in your R code,
without fully recreating it. The content of R objects can vary from very simple
(the `"GoFedEx"` string in the example code above) to very complex objects with
lots of elements (for example, a machine learning model).
Objects can be structured in different ways, in terms of how they "hold" data.
These difference structures are called **object classes**. One class of objects
can be a subtype of a more general object class.
There are a variety of different object types in R, shaped to fit different
types of objects ranging from the simple to complex. In this section, we'll
start by describing two object types that you will use most often in basic data
analysis, **vectors** (1-dimensional objects) and **dataframes** (2-dimensional
objects).
For these two object classes (vectors and dataframes), we'll look at:
1. How that class is structured
2. How to make a new object with that class
3. How to extract values from objects with that class
In later classes, we'll spend a lot of time learning how to do other things
with objects from these two classes, plus learn some other classes.
### Vectors
To get an initial grasp of the *vector* object type in R, think of it as a
1-dimensional object, or a string of values. Figure \@ref(fig:vector-example)
provides an example of the structure for a very simple vector, one that holds
the names of the three main characters in the *Harry Potter* book series.
```{r vector-example, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="An example of the structure of an R object with the vector class. This object class contains data as a string of values, all with the same data type."}
knitr::include_graphics("figures/example_vector.jpg")
```
All values in a vector must be of the same data type (i.e., all numbers, all
characters, all dates). If you try to create a vector with elements from
different types (like "FedEx", which is a character, and 3, a number), R will
coerce all of the elements to the most generic type of any of the elements
(i.e., "FedEx" and "3" will both become characters, since "3" can be changed to
a character, but "FedEx" can't be changed to a number). Figure
\@ref(fig:vector-example-classes) gives some examples of different classes of
vectors.
```{r vector-example-classes, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Examples of vectors of different classes. All the values in a vector must be of the same type (e.g., all numbers, all characters). There are different classes of vectors depending on the type of data they store."}
knitr::include_graphics("figures/vector_class_examples.jpg")
```
To create a vector from different elements, you'll use the concatenation
function, `c` to join them together, with commas between the elements. For
example, to create the vector shown in Figure \@ref(fig:vector-example), you
can run:
```{r}
c("Harry", "Ron", "Hermione")
```
If you want to use that object later, you can assign it an object name in the expression:
```{r}
main_characters <- c("Harry", "Ron", "Hermione")
print(x = main_characters)
```
This **assignment expression**, for assigning a vector an object name, follows
the structure we covered earlier for function calls and assignment expressions
(Figure \@ref(fig:vector-assignment)).
```{r vector-assignment, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Elements of the assignment expression for creating a vector and assigning it an object name."}
knitr::include_graphics("figures/vector_class_examples.jpg")
```
If you create a numeric vector, you should not put the values in quotation marks:
```{r}
n_kids <- c(1, 7, 1)
```
If you mix classes when you create the vector, R will coerce all the elements
to most generic of the elements' classes:
```{r}
mixed_classes <- c(1, 3, "five")
mixed_classes
```
Notice that the two integers, 1 and 3, are now in quotation marks, once they
are put in a vector with a value with the character data type. You can use the
`class` function to determine the class of an object:
```{r}
class(x = mixed_classes)
```
A vector's *length* is the number of elements in the vector. You can use the
`length` function to determine a vector's length:
```{r}
length(x = mixed_classes)
```
Once you create an object, you will often want to reference the whole object in
future code. However, there will be some times when you'll want to reference
just certain elements of the object (for example, the first three values). You
can pull out certain values from a vector by using indexing with square brackets
(`[...]`) to identify the locations of the element you want to extract. For
example, to extract the second element of the `main_characters` vector, you can
run:
```{r}
main_characters[2] # Get the second value
```
You can use this same method to extract more than one value. You just need to
create a numeric vector with the position of each element you want to extract
and pass that in the square brackets. For example, to extract the first and
third elements of the `main_characters` vect, you can run:
```{r}
main_characters[c(1, 3)] # Get first and third values
```
The `:` operator can be very helpful with extracting values from a vector.
This operator creates a sequence of values from the value before the `:` to the
value after `:`, going by units of 1. For example, if you want to create a list
of the numbers between 1 and 10, you can run:
```{r}
1:10
```
If you want to extract the first two values from the `main_characters` vector, you
can use the `:` operator:
```{r}
main_characters[1:2] # Get the first two values
```
You can also use logic to pull out some values of a vector. For example, you
might only want to pull out even values from the `fibonacci` vector. We'll cover
using logical expressions to index vectors later in the book.
```{block, type = 'rmdtip'}
One thing that people often find confusing when they start using R is knowing
when to use and not use quotation marks. The general rule is that you use
quotation marks when you want to refer to a character string literally, but no
quotation marks when you want to refer to the value in a previously-defined
object. For example, if you saved the string `"Anderson"` as the object
`my_name` (`my_name <- "Anderson"`), then in later code, if you type `my_name`
(no quotation marks), you'll get `"Anderson"`, while if you type out `"my_name"`
(with quotation marks), you'll get `"my_name"` (what you typed, literally).
One thing that makes this rule confusing is that there are a few cases in R
where you really should (by this rule) use quotation marks, but the function is
coded to let you be lazy and get away without them. One example is the `library`
function. In the code earlier in this section to load the "phonenumber" package,
you want to literally load the package "phonenumber", rather than load whatever
character string is saved in the object named `phonenumber`. However, `library`
is one of the functions where you can be lazy and skip the quotation marks, and
it will still load "phonenumber" for you. Therefore, if you want, this function
also works if you call `library(package = phonenumber)` (without the quotation marks)
instead of how we actually called it (`library(package = phonenumber)`).
```
### Dataframes
A dataframe is a 2-dimensional object, and is made of one or more vectors of the
same length stuck together side-by-side. It is the closest R has to an Excel
spreadsheet-type structure. Figure \@ref(fig:example-dataframe) gives a
conceptual example of a dataframe created from several of the vector examples in
Figure \@ref(vector-example-classes).
```{r example-dataframe, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="An example dataframe, created from several vectors of the same length and with observations aligned across vector positions (for example, the first value in each vector provides a value for Harry, the second for Ron)."}
knitr::include_graphics("figures/example_dataframe.jpg")