-
Notifications
You must be signed in to change notification settings - Fork 0
/
analysis-in-tidyverse.Rmd
158 lines (117 loc) · 7.73 KB
/
analysis-in-tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Analysis functions in Tidyverse"
output:
---
At the top of every script in your first code chunk, best practice is to load all the packages that you need for your script.
Packages need to be installed once on your computer before you can use them; then to use them in a script, use the library() function. To install a package, use install.packages().
We primarily use the [tidyverse](https://www.tidyverse.org) to work with data, written primarily by Hadley Wickham of Posit (who also maintains RStudio). The Tidyverse is a suite of packages, and you can install them all at once by running the following code on your computer:
```{r}
install.packages("tidyverse")
```
Now to use the functions that come with the Tidyverse code, we need to load tidyverse into our script:
```{r}
library(tidyverse)
```
One of the packages, [dplyr](dplyr.tidyverse.org) somes with many useful functions for importing and analyzing data:
- read_csv() : import flat text files that are comma delimited into a "tibble" (a tidyverse table).
- arrange() : sort table data by a particular column or columns.
- filter() : pull out a subset of table data with one or multiple criteria.
- count() : quickly get a list of all unique values in a column and how many rows are associated with each value.
- summarise() : use summary functions to calculate basic statistics about your data as a whole, or the groups you create in group_by().
- group_by() : group the rows of your table based on the values in a column or columns so you can apply summary functions to those groups.
To start, we'll import a CSV file in our "data" folder. Because we're using an R Project (by opening RStudio using the .RProj file), R is already looking inside that project folder. We just need to tell it to step into the data folder and look for a particular CSV.
The data we'll start with is Major League Baseball salaries from 2021, courtesy of USA Today. This dataset is in a file called "mlb":
```{r}
read_csv("data/mlb.csv")
```
Run that code and R will spit out the table (tibble) below the code chunk. You'll see that the data has 898 rows and 5 columns (898 x 5). R will also paginate the results so you're only seeing ten rows at a time (flip through all the rows at the bottom of the table).
Remember, results of your code either print below the code chunk (as above) or get stored in the environment. To work with this dataset further, we want to store it in the environment. So add the assignment operator `<-` and a variable name in front of the read_csv() function:
```{r}
mlb <- read_csv("data/mlb.csv")
```
Now you'll see it in the environment.
## arrange() ##
Working through the functions mentioned above, try arrange(). Sort the data by the column *salary*. To do this, start with the variable name and then pipe it into the function you want to use. The shortcut for the pipe character is `cmd+option+m` (`ctrl+shift+m` on a PC):
```{r}
mlb %>% arrange(salary)
```
We're seeing the lowest paid players on top; by default arrange() uses ascending or alphabetical order. To reverse the order, add the desc() function inside arrange():
```{r}
mlb %>% arrange(desc(salary))
```
Now the highest paid player - for 2021, Mike Trout - appears on top.
You can also use two columns in your arrange(), but think through the order. If you want to sort by both team and salary (to see the best paid player for each team, for example), you need to put team first. If you put salary first, the team sort will only come into play when there are multiple players making the exact same salary.
```{r}
mlb %>% arrange(team, desc(salary))
```
## filter() ##
The filter() function takes one or more criteria to create a subset of your data. To see only players who played for the Baltimore Orioles in 2021, pipe mlb into the filter() function and choose only rows where the value in column *team* is equal to "Baltimore Orioles". To do this, use == instead of =; two equal signs tests for equality, one equal sign is used for assignment.
```{r}
mlb %>% filter(team == "Baltimore Orioles")
```
You can use the pipe to string together multiple functions, creating a pipeline for your data:
```{r}
mlb %>% filter(team == "Baltimore Orioles") %>% arrange(desc(salary))
```
To use multiple criteria you should use boolean logic: separate them with AND or OR. In tidyverse, these are represented as & and |.
So to look for the Baltimore Orioles OR the Washington Nationals:
```{r}
mlb %>%
filter(team=="Baltimore Orioles" | team=="Washington Nationals")
```
You can use line returns in your code to make it more readable; be sure to add line returns at the start of a function, so that pipes are always at the end of a line:
```{r}
mlb %>%
filter(team == "Baltimore Orioles" | team == "Washington Nationals") %>%
arrange(desc(salary))
```
When filtering, remember that OR *broadens* your results, AND *narrows* them. So adding `| team == "Washington Nationals"` to your original filter returns MORE results. If we add another element using &, we'll get fewer results:
```{r}
mlb %>%
filter(team=="Baltimore Orioles" & salary > 10000000) %>%
arrange(desc(salary))
```
This is a good time to point out that R is very case sensitive: function names, variable names, column names, text you're searching for: it must all match exactly with very few exceptions. If you try using Filter() it will not work.
## count() ##
So how did we know that the Orioles were written in the data as "Baltimore Orioles"? There are several ways to determine what the values look like in a column. Usually you do this when you're vetting your data, before you start analyzing. You can use the count() function to see what the unique values are in a column and how many rows are associated with each value:
```{r}
mlb %>% count(team)
```
There are 30 teams in Major League Baseball, and most of them have about 30 players. The names are clean, there aren't misspellings or duplicates.
You can also use the distinct() function, which returns the unique values in a column without counting rows:
```{r}
mlb %>% distinct(team)
```
Finally, you can search for a key word or words using str_detect() - this is part of the stringr package. This returns the full rows from your data:
```{r}
mlb %>% filter(str_detect(team, "Orioles"))
```
## summarise() ##
The summarise() function allows you to run summary functions (e.g. sum(), mean(), median()) on your data, distilling it down into one number. To find the average MLB salary, use mean() inside of summarise():
```{r}
mlb %>% summarise(avg_salary = mean(salary))
```
To find out how much is spent on MLB salaries, all together, use sum:
```{r}
mlb %>% summarise(total_salary = sum(salary))
```
Because sum(), mean(), etc are base R functions, they can't receive information from pipes. So summarise() prepares the data for you. You can calculate multiple descriptive statistics at once. It's also best practice to name the resulting calculations for reasons that will make more sense later.
```{r}
mlb %>% summarise(total_salary = sum(salary), avg_salary = mean(salary))
```
## group_by() ##
To group your rows into larger groups for further analysis, use group_by(). For example, we've been investigating individual player salaries. Then we calculated summary statistics on the whole dataset. But what if we want to know the average salary by team? We need to change our unit of analysis from player (individual rows) to team (groups of rows).
```{r}
mlb %>%
group_by(team) %>%
summarise(avg_salary = mean(salary)) %>%
arrange(desc(avg_salary))
```
To add the total number of players on each team, add the summary function n():
```{r}
mlb %>%
group_by(team) %>%
summarise(avg_salary = mean(salary), players = n()) %>%
arrange(desc(avg_salary))
```
If you're ready to do more with data analysis, go to bloomington-salaries-analysis.Rmd.