-
Notifications
You must be signed in to change notification settings - Fork 109
/
first_step_of_data_analysis.Rmd
217 lines (144 loc) · 6.35 KB
/
first_step_of_data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# (PART) Case studies {-}
# The first step to analyse a dataset
Weitao Chen and Jianing Li
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Oftentimes we get a large dataset to conduct data analysis or to prepare the data to be further used in our model. A common question data analysts hear is “where do I even start in my analysis?”. We will introduce serveral functions to help you take a glance at your giant dataset so you can have a fundemental undetstanding of your data.
```{r, include=FALSE}
# > install.packages("mi")
# Transform some values to NA
library(tidyverse)
library(dplyr)
clean.iris <- iris
iris <- as.data.frame(lapply(iris, function(cc) cc[
sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(cc), replace = TRUE) ]))
```
## A glimpse at the dataset
### How does the data look like?
When you come across a new dataset, you may ask: How does the dataset look like?
"To see is to know. Not to see is to guess." To answer this question, you just need to view its content.
`View` works not only in RStudio, but also in R terminal on Windows. However, its output can't be knitted into html or pdf.
Though, when the dataset is huge, you would not really like to open and view the entire dataset. Then what? Just take a sample of it.
The first function that comes to your mind? `head`!
```{r pressure, echo=FALSE}
head(iris)
```
`head` returns the first 5 rows of a data frame by default. You can also specify how many rows you want.
```{r}
head(iris,3)
```
`tail` is similar to it, but it retrives the rows from the bottom.
```{r}
tail(iris,2)
```
As you can see, we get only the flowers of species "setosa" with `head` and of "virginica" with `tail`. That's because the original dataset is ordered by the "Species", and its head and tail are homogeneous.
What if we want to get a heterogeneous view? You may use `sample_n` instead. It selects random rosw from a table for you.
```{r}
library(dplyr)
sample_n(iris, 5)
```
When you want to take a glimpse at a huge dataset, it's fairly better choice to use `sample_n`. However, unlike `head`, `tail` or `View`, it doesn't work on vector or types other than data frame. Also, you have to library `dplyr` to use `sample_n`, while other 3 functions come in `baser`.
### Retrive the metadata
For a data frame, one of the first property we want to learn about it would be its size.
```{r}
dim(iris)
```
The first value in the result is the number of observations and the second is of variables.
Sometimes we don't really want to know the details of a dataset. We simply need a summary.
```{r}
summary(iris)
```
When used on a data frame, `summary` gives you the metadata of the dataset. That is, the summaries for each vairable in it. For numerical column, you will see the percentiles and mean. For categorical ones, you can get the frequency of each factors.
A simpler alternative is `str`. Note that `str` here is not an abbreviation for "string", but for "structure". `str` "compactly displays the structure of an arbitrary R object", according to the document. As for data frame, it displays the data type and first few values of each columns.
```{r}
str(iris)
```
## Dive into one column
### Summarise a numerical variable
If you would like to examine some columns more thoroughly, you can use `summarise` to customize your summary. It takes the data frame as the first parameter. Name-value pairs of summary functions are followed, indicating the output titles and content.
Available summary functions are as followed:
- Center: mean(), median()
- Spread: sd(), IQR(), mad()
- Range: min(), max(), quantile()
- Position: first(), last(), nth(),
- Count: n(), n_distinct()
- Logical: any(), all()
```{r}
library(dplyr)
iris %>%
summarize(mean.1 = mean(Petal.Length, na.rm=TRUE), sd.1 = sd(Petal.Length, na.rm=TRUE),
mean.2 = mean(Sepal.Length, na.rm=TRUE), sd.2 = sd(Sepal.Length, na.rm=TRUE))
```
### Understand a categorical variable
You may learn about the frequency about a categorical column in a data frame using `summary`.
```{r}
summary(iris$Species)
```
Though, it doesn't work when the column is presented not as a factor, but as characters.
```{r}
iris$Characters <- as.character(iris$Species)
summary(iris$Characters)
```
Under such circumstances, we may use `unique` and `table` instead. `unique` tells you about all the unique values in a vector, and `table` shows their frequency.
```{r}
unique(iris$Characters)
```
```{r}
table(iris$Characters)
```
## Advanced patterns about a data set
### Locate the missing values
Sometime there are missing values or NA in our dataset. It can be caused by a number of reasons such as observations that were not recorded and data corruption. Or it can be left as NA on purpose to indicate something. The important thing is how to find the missing values in our dataset.
In this block we will introduce several functions to find the missing values
- Check missing values by columns
```{r}
colSums(is.na(iris)) %>%
sort(decreasing=TRUE)
```
- Check missing values by row
```{r}
rowSums(is.na(iris)) %>%
sort(decreasing=TRUE)
```
- Use a heatmap to check missing values
```{r}
library(ggplot2)
tidyiris <- iris %>%
rownames_to_column("id") %>%
gather(key, value, -id) %>%
mutate(missing = ifelse(is.na(value), "yes", "no"))
ggplot(tidyiris, aes(x = key, y = fct_rev(id), fill = missing)) +
geom_tile(color = "white") +
ggtitle("iris with NAs added") +
scale_fill_viridis_d() + # discrete scale
theme_bw()
```
```{r, include=FALSE}
library(mi)
```
```{r}
if("mi" %in% rownames(installed.packages()) == TRUE) {
library(mi)
x <- missing_data.frame(iris)
image(x)
}
```
### Find the outlier for numeric values
An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.[3] An outlier can cause serious problems in statistical analyses.
Here we find the outliers in each columns.
```{r}
tidyiris <- clean.iris[1:4] %>%
rownames_to_column("id") %>%
gather(key, value, -id)
ggplot(data=tidyiris)+
geom_boxplot(mapping=aes(x=key,y=value))
```
Thus we can see that there are 4 outliers in Sepal.Width variable.
### Find out the correlations among variables
For our numeric data, it is important to find the correlations between variables, since it provides us extra information about the dataset.
```{r}
library(GGally)
ggpairs(clean.iris[1:4])
```