-
Notifications
You must be signed in to change notification settings - Fork 0
/
dataframes.qmd
154 lines (94 loc) · 5.75 KB
/
dataframes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
# jupyter: julia-1.10
engine: julia
---
# Part 2: Dataframes
Dataframes are one of the most important objects in data science.
A dataframe is a table where each row is an observation and each column is a variable.
::: {.callout}
A dataframe `df` is a list of vectors, all with the same length.
A column of `df` is just one if its vectors.
The `i-th` row of `df` is the vector formed by the `i-th` coordinate of each of its columns.
:::
We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.
```{julia}
#| eval: true
using DataFrames, PalmerPenguins
using Tidier, Chain
import DataFramesMeta as DFM
penguins = PalmerPenguins.load() |> DataFrame
```
## Libraries
### Dataframes
`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.
### DataFramesMeta
DataFramesMeta is a collection of macros based on DataFrames. It provides many syntatic helpers to slice rows, create columns and summarise data.
### Tidier
Tidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this "tidy" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).
In this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.
## Operations
Let's start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.^[Join operations will be dealt later.]. We can divide these operations in some categories:
### Rows operations
These are operations that only affect rows, leaving all columns untouched.
- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.
- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.
### Column operations
These are operations that only affect columns, leaving all rows untouched.
- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.
- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.
### Reshaping operations
These operations change the shape of a dataframe, making it wider or longer.
- `Widening`
- `Longering`?
### Grouping operations
- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.
### Mixed operations
These operations can possibly change rows and columns at the same time.
- Distinct;
- Counting;
- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.
??? deixar grupo e sumário juntos?
Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.
Now for binary operations (ie. operations that take two dataframes), we have all the joins:
- Left join;
- Right join;
- Inner join;
- Outer join;
- Full join.
## Comparing Tidier with DataFramesMeta
The following table list the operations on each package:
| dplyr | Tidier | DataFramesMeta | DataFrames |
|-------------|--------------|------------------------------|--------------|
| `filter` | `@filter` | `@subset` / `@rsubset` | `subset` |
| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |
| `select` | `@select` | `@select` | array sintax |
| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |
| `group_by` | `@group_by` | `@groupby` | `groupby` |
| `summarise` | `@summarise` | `@combine` | `combine` |
It is clear that for those coming from `R`, Tidier will look like the most natural approach.
Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.
We will see each operation with more details in the following chapters.
## Chaining operations
We can chain (or pipe) dataframe operations as follows with the `@chain` macro:
```{julia}
#| eval: false
@chain penguins begin
@filter !ismissing(sex)
@group_by sex
@summarise mean = mean(bill_length_mm)
@arrange mean
end
```
## Using variables as column names
In Tidier, using the column names as if they were variables in the environment leads to some complication when we want to use other variables that are not column names.
For example, suppose you want to arrange penguins by a column that is stored in a variable.
When this happens, we add `@eval` before the Tidier code and add a `$` to force evaluation of the variable, as in the following example:
```{julia}
#| eval: false
my_arrange_column = :body_mass_g;
@eval @arrange penguins $my_arrange_column
```
## Documentation
https://dataframes.juliadata.org/stable/man/working_with_dataframes/
https://juliadata.org/DataFramesMeta.jl/stable
https://tidierorg.github.io/TidierData.jl/latest/reference/