\n```\n:::\n:::\n\n\n\n\n\n\n\n## Filtering (or: throwing lines away)\n\nTo filter a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses a whole column, for example:\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.2
19.6
195
4675
male
2
Adelie
Torgersen
42.0
20.2
190
4250
missing
3
Adelie
Torgersen
34.6
21.1
198
4400
male
4
Adelie
Torgersen
42.5
20.7
197
4500
male
5
Adelie
Dream
39.8
19.1
184
4650
male
6
Adelie
Dream
44.1
19.7
196
4400
male
7
Adelie
Dream
39.6
18.8
190
4600
male
8
Adelie
Biscoe
40.1
18.9
188
4300
male
9
Adelie
Biscoe
41.3
21.1
195
4400
male
10
Adelie
Torgersen
41.8
19.4
198
4450
male
11
Adelie
Torgersen
42.8
18.5
195
4250
male
12
Adelie
Torgersen
42.9
17.6
196
4700
male
13
Adelie
Dream
41.1
18.1
205
4300
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
138
Gentoo
Biscoe
47.2
13.7
214
4925
female
139
Gentoo
Biscoe
46.8
14.3
215
4850
female
140
Gentoo
Biscoe
50.4
15.7
222
5750
male
141
Gentoo
Biscoe
45.2
14.8
212
5200
female
142
Gentoo
Biscoe
49.9
16.1
213
5400
male
143
Chinstrap
Dream
49.2
18.2
195
4400
male
144
Chinstrap
Dream
52.8
20.0
205
4550
male
145
Chinstrap
Dream
54.2
20.8
201
4300
male
146
Chinstrap
Dream
52.0
20.7
210
4800
male
147
Chinstrap
Dream
53.5
19.9
205
4500
male
148
Chinstrap
Dream
50.8
18.5
201
4450
male
149
Chinstrap
Dream
49.0
19.6
212
4300
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because *each variable is interpreted as a vector (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n\nIn the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that `false` and `missing` are thrown away.\n\nIn pure DataFrames, we use the `subset` function, and the criteria is passed with the notation\n\n\n\n\n\n::: {#12 .cell execution_count=0}\n``` {.julia .cell-code}\nsubset(penguins, :column => boolean_function)\n\n```\n:::\n\n\n\n\n\n\n\nwhere `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable. Add the kwarg `skipmissing=true` if you want to get rid of missing values.\n\n### Filtering with one criteria\n\nFiltering all the rows with `species` == \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#14 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :species => x -> x .== \"Adelie\", skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with several criteria\n\nFiltering all the rows with `species` == \"Adelie\", `sex` == \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#20 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nFiltering all the rows where the `flipper_length_mm` is greater than the mean.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#32 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#36 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Dream
35.7
18.0
202
3550
female
2
Adelie
Dream
41.1
18.1
205
4300
male
3
Adelie
Dream
40.8
18.9
208
4300
male
4
Adelie
Biscoe
41.0
20.0
203
4725
male
5
Adelie
Torgersen
41.4
18.5
202
3875
male
6
Adelie
Torgersen
44.1
18.0
210
4000
male
7
Adelie
Dream
41.5
18.5
201
4000
male
8
Gentoo
Biscoe
46.1
13.2
211
4500
female
9
Gentoo
Biscoe
50.0
16.3
230
5700
male
10
Gentoo
Biscoe
48.7
14.1
210
4450
female
11
Gentoo
Biscoe
50.0
15.2
218
5700
male
12
Gentoo
Biscoe
47.6
14.5
215
5400
male
13
Gentoo
Biscoe
46.5
13.5
210
4550
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
137
Chinstrap
Dream
53.5
19.9
205
4500
male
138
Chinstrap
Dream
49.0
19.5
210
3950
male
139
Chinstrap
Dream
50.8
18.5
201
4450
male
140
Chinstrap
Dream
49.0
19.6
212
4300
male
141
Chinstrap
Dream
51.4
19.0
201
3950
male
142
Chinstrap
Dream
50.7
19.7
203
4050
male
143
Chinstrap
Dream
49.3
19.9
203
4050
male
144
Chinstrap
Dream
50.2
18.8
202
3800
male
145
Chinstrap
Dream
51.9
19.5
206
3950
male
146
Chinstrap
Dream
55.8
19.8
207
4000
male
147
Chinstrap
Dream
43.5
18.1
202
3400
female
148
Chinstrap
Dream
50.8
19.0
210
4100
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with a variable column name\n\nSuppose the column you want to filter is a variable, let's say\n\n\n\n\n\n::: {#38 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column = :species;\n```\n:::\n\n\n\n\n\n\n\n::: {.panel-tabset}\n\n## DataFramesMeta\n\n\n\n\n\n::: {#40 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $my_column == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#42 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\nIn case the column is a string\n\n\n\n\n\n::: {#44 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column_string = \"species\";\n```\n:::\n\n\n\n\n\n\n\ninstead of a symbol, we can write in the same way\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#46 .cell execution_count=1}\n``` {.julia .cell-code}\n# @filter(penguins, !!my_column == \"Adelie\")\n```\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#48 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $(my_column_string) == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#50 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column_string => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Arranging\n\nArranging is when we reorder the rows of a dataframe according to some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.\n\n### Arranging by one column\n\nArrange by `body_mass_g`.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#52 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins body_mass_g\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Arranging by two columns, with one reversed\n\nFirst arrange by `island`, then by reversed `body_mass_g`.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#58 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins island desc(body_mass_g)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Gentoo
Biscoe
missing
missing
missing
missing
missing
2
Gentoo
Biscoe
49.2
15.2
221
6300
male
3
Gentoo
Biscoe
59.6
17.0
230
6050
male
4
Gentoo
Biscoe
51.1
16.3
220
6000
male
5
Gentoo
Biscoe
48.8
16.2
222
6000
male
6
Gentoo
Biscoe
45.2
16.4
223
5950
male
7
Gentoo
Biscoe
49.8
15.9
229
5950
male
8
Gentoo
Biscoe
48.4
14.6
213
5850
male
9
Gentoo
Biscoe
49.3
15.7
217
5850
male
10
Gentoo
Biscoe
55.1
16.0
230
5850
male
11
Gentoo
Biscoe
49.5
16.2
229
5800
male
12
Gentoo
Biscoe
48.6
16.0
230
5800
male
13
Gentoo
Biscoe
50.4
15.7
222
5750
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Adelie
Torgersen
41.1
18.6
189
3325
male
334
Adelie
Torgersen
38.5
17.9
190
3325
female
335
Adelie
Torgersen
37.8
17.1
186
3300
missing
336
Adelie
Torgersen
38.8
17.6
191
3275
female
337
Adelie
Torgersen
40.3
18.0
195
3250
female
338
Adelie
Torgersen
41.1
17.6
182
3200
female
339
Adelie
Torgersen
34.6
17.2
189
3200
female
340
Adelie
Torgersen
36.2
17.2
187
3150
female
341
Adelie
Torgersen
35.9
16.6
190
3050
female
342
Adelie
Torgersen
35.2
15.9
186
3050
female
343
Adelie
Torgersen
39.0
17.1
191
3050
female
344
Adelie
Torgersen
38.6
17.0
188
2900
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#60 .cell execution_count=1}\n``` {.julia .cell-code}\n# works only when the reversed column is numeric?\n\nDFM.@orderby penguins :island :body_mass_g .* -1\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Arranging by one variable column\n\nLet's arrange the data by the following column:\n\n\n\n\n\n::: {#64 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_arrange_column = :body_mass_g;\n```\n:::\n\n\n\n\n\n\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#66 .cell execution_count=1}\n``` {.julia .cell-code}\n#?? how to do it?\n# @arrange penguins !!my_arrange_column\n```\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#68 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@orderby penguins $my_arrange_column\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n",
+ "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Operations on rows\n\nIn this chapter we will see operations that deal with rows, be it ordering or throwing some rows away.\n\nThe following is necessary to run all examples:\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame;\n@slice_head(penguins, n = 10)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
10×7 DataFrame
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
\n```\n:::\n:::\n\n\n\n\n\n\n\n## Filtering (or: throwing rows away)\n\nTo *filter* a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).\n\nTo filter in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses a whole column, for example:\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.2
19.6
195
4675
male
2
Adelie
Torgersen
42.0
20.2
190
4250
missing
3
Adelie
Torgersen
34.6
21.1
198
4400
male
4
Adelie
Torgersen
42.5
20.7
197
4500
male
5
Adelie
Dream
39.8
19.1
184
4650
male
6
Adelie
Dream
44.1
19.7
196
4400
male
7
Adelie
Dream
39.6
18.8
190
4600
male
8
Adelie
Biscoe
40.1
18.9
188
4300
male
9
Adelie
Biscoe
41.3
21.1
195
4400
male
10
Adelie
Torgersen
41.8
19.4
198
4450
male
11
Adelie
Torgersen
42.8
18.5
195
4250
male
12
Adelie
Torgersen
42.9
17.6
196
4700
male
13
Adelie
Dream
41.1
18.1
205
4300
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
138
Gentoo
Biscoe
47.2
13.7
214
4925
female
139
Gentoo
Biscoe
46.8
14.3
215
4850
female
140
Gentoo
Biscoe
50.4
15.7
222
5750
male
141
Gentoo
Biscoe
45.2
14.8
212
5200
female
142
Gentoo
Biscoe
49.9
16.1
213
5400
male
143
Chinstrap
Dream
49.2
18.2
195
4400
male
144
Chinstrap
Dream
52.8
20.0
205
4550
male
145
Chinstrap
Dream
54.2
20.8
201
4300
male
146
Chinstrap
Dream
52.0
20.7
210
4800
male
147
Chinstrap
Dream
53.5
19.9
205
4500
male
148
Chinstrap
Dream
50.8
18.5
201
4450
male
149
Chinstrap
Dream
49.0
19.6
212
4300
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because *each variable is interpreted as a vector (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n\nIn the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (**r**ow subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that `false` and `missing` are thrown away.\n\nIn pure DataFrames, we use the `subset` function, and the criteria is passed with the notation\n\n\n\n\n\n::: {#12 .cell execution_count=0}\n``` {.julia .cell-code}\nsubset(penguins, :column => boolean_function)\n\n```\n:::\n\n\n\n\n\n\n\nwhere `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable (the `:column` you passed). Add the kwarg `skipmissing=true` if you want to get rid of missing values.\n\n### Filtering with one criteria\n\n**Problem:** Filtering all the rows with `species` == \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#14 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :species => x -> x .== \"Adelie\", skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with several criteria\n\n**Problem:** Filtering all the rows with `species` == \"Adelie\", `sex` == \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#20 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with metadata\n\nBy metadata here we mean data that is inside the dataframe, as the mean/max/min of a column.\n\n**Problem:** Filtering all the rows where the `flipper_length_mm` is greater than the mean.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#32 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#36 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Dream
35.7
18.0
202
3550
female
2
Adelie
Dream
41.1
18.1
205
4300
male
3
Adelie
Dream
40.8
18.9
208
4300
male
4
Adelie
Biscoe
41.0
20.0
203
4725
male
5
Adelie
Torgersen
41.4
18.5
202
3875
male
6
Adelie
Torgersen
44.1
18.0
210
4000
male
7
Adelie
Dream
41.5
18.5
201
4000
male
8
Gentoo
Biscoe
46.1
13.2
211
4500
female
9
Gentoo
Biscoe
50.0
16.3
230
5700
male
10
Gentoo
Biscoe
48.7
14.1
210
4450
female
11
Gentoo
Biscoe
50.0
15.2
218
5700
male
12
Gentoo
Biscoe
47.6
14.5
215
5400
male
13
Gentoo
Biscoe
46.5
13.5
210
4550
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
137
Chinstrap
Dream
53.5
19.9
205
4500
male
138
Chinstrap
Dream
49.0
19.5
210
3950
male
139
Chinstrap
Dream
50.8
18.5
201
4450
male
140
Chinstrap
Dream
49.0
19.6
212
4300
male
141
Chinstrap
Dream
51.4
19.0
201
3950
male
142
Chinstrap
Dream
50.7
19.7
203
4050
male
143
Chinstrap
Dream
49.3
19.9
203
4050
male
144
Chinstrap
Dream
50.2
18.8
202
3800
male
145
Chinstrap
Dream
51.9
19.5
206
3950
male
146
Chinstrap
Dream
55.8
19.8
207
4000
male
147
Chinstrap
Dream
43.5
18.1
202
3400
female
148
Chinstrap
Dream
50.8
19.0
210
4100
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with a variable column name\n\nSuppose the column you want to filter is a variable, let's say a symbol\n\n\n\n\n\n::: {#38 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column = :species;\n```\n:::\n\n\n\n\n\n\n\n**Problem:** Filtering all the rows where the column stored in `my_column` is \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#40 .cell execution_count=1}\n``` {.julia .cell-code}\n@eval @filter penguins $my_column == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#44 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\nIn case the column is a string\n\n\n\n\n\n::: {#46 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column_string = \"species\";\n```\n:::\n\n\n\n\n\n\n\ninstead of a symbol, we can write in the same way, just taking care in Tidier to convert it to a symbol\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#48 .cell execution_count=1}\n``` {.julia .cell-code}\n@eval @filter penguins $(Symbol(my_column_string)) == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#52 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column_string => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Arranging\n\nTo *arrange* a dataframe means to reorder the rows according to the order of some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.\n\n### Arranging by one column\n\n**Problem:** Arrange by `body_mass_g`.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#54 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins body_mass_g\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Arranging by two columns, with one reversed\n\n**Problem:** First arrange by `island`, then by reversed `body_mass_g`.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#60 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins island desc(body_mass_g)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Gentoo
Biscoe
missing
missing
missing
missing
missing
2
Gentoo
Biscoe
49.2
15.2
221
6300
male
3
Gentoo
Biscoe
59.6
17.0
230
6050
male
4
Gentoo
Biscoe
51.1
16.3
220
6000
male
5
Gentoo
Biscoe
48.8
16.2
222
6000
male
6
Gentoo
Biscoe
45.2
16.4
223
5950
male
7
Gentoo
Biscoe
49.8
15.9
229
5950
male
8
Gentoo
Biscoe
48.4
14.6
213
5850
male
9
Gentoo
Biscoe
49.3
15.7
217
5850
male
10
Gentoo
Biscoe
55.1
16.0
230
5850
male
11
Gentoo
Biscoe
49.5
16.2
229
5800
male
12
Gentoo
Biscoe
48.6
16.0
230
5800
male
13
Gentoo
Biscoe
50.4
15.7
222
5750
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Adelie
Torgersen
41.1
18.6
189
3325
male
334
Adelie
Torgersen
38.5
17.9
190
3325
female
335
Adelie
Torgersen
37.8
17.1
186
3300
missing
336
Adelie
Torgersen
38.8
17.6
191
3275
female
337
Adelie
Torgersen
40.3
18.0
195
3250
female
338
Adelie
Torgersen
41.1
17.6
182
3200
female
339
Adelie
Torgersen
34.6
17.2
189
3200
female
340
Adelie
Torgersen
36.2
17.2
187
3150
female
341
Adelie
Torgersen
35.9
16.6
190
3050
female
342
Adelie
Torgersen
35.2
15.9
186
3050
female
343
Adelie
Torgersen
39.0
17.1
191
3050
female
344
Adelie
Torgersen
38.6
17.0
188
2900
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#62 .cell execution_count=1}\n``` {.julia .cell-code}\n# works only when the reversed column is numeric?\n\nDFM.@orderby penguins :island :body_mass_g .* -1\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n",
"supporting": [
"dataframes-rows_files"
],
diff --git a/_freeze/dataframes/execute-results/html.json b/_freeze/dataframes/execute-results/html.json
index 10da5c1..891469a 100644
--- a/_freeze/dataframes/execute-results/html.json
+++ b/_freeze/dataframes/execute-results/html.json
@@ -1,10 +1,10 @@
{
- "hash": "bf3bf9fbea01582cf3465d388dc8f6aa",
+ "hash": "1c99b94b83399a24e7a5d3f101a0a8b5",
"result": {
"engine": "julia",
- "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. \n\nA dataframe is a table where each row is an observation and each column is a variable.\n\n::: {.callout}\nA dataframe is a list of vectors all with the same length. \n:::\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier, Chain\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Chinstrap
Dream
45.2
16.6
191
3250
female
334
Chinstrap
Dream
49.3
19.9
203
4050
male
335
Chinstrap
Dream
50.2
18.8
202
3800
male
336
Chinstrap
Dream
45.6
19.4
194
3525
female
337
Chinstrap
Dream
51.9
19.5
206
3950
male
338
Chinstrap
Dream
46.8
16.5
189
3650
female
339
Chinstrap
Dream
45.7
17.0
195
3650
female
340
Chinstrap
Dream
55.8
19.8
207
4000
male
341
Chinstrap
Dream
43.5
18.1
202
3400
female
342
Chinstrap
Dream
49.6
18.2
193
3775
male
343
Chinstrap
Dream
50.8
19.0
210
4100
male
344
Chinstrap
Dream
50.2
18.7
198
3775
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros based on DataFrames.\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.\n:::\n\n## Operations\n\nLet's start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.^[Join operations will be dealt later.]. We can divide these operations in some categories:\n\n### Rows operations\n\nThese are operations that only affect rows, leaving all columns untouched.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\n### Column operations\n\nThese are operations that only affect columns, leaving all rows untouched.\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n### Reshaping operations\n\nThese operations change the shape of a dataframe, making it wider or longer.\n\n- `Widening`\n\n- `Longering`?\n\n### Grouping operations\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n### Mixed operations\n\nThese operations can possibly change rows and columns at the same time.\n\n- Distinct;\n- Counting;\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n??? deixar grupo e sumário juntos?\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nNow for binary operations (ie. operations that take two dataframes), we have all the joins:\n\n- Left join;\n- Right join;\n- Inner join;\n- Outer join;\n- Full join.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `filter` | `@filter` | `@subset` / `@rsubset` | `subset` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n| `select` | `@select` | `@select` | array sintax |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n\nIt is clear that for those coming from `R`, Tidier will look like the most natural approach.\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\nWe will see each operation with more details in the following chapters.\n\n## Chaining operations\n\nWe can chain (or pipe) dataframe operations as follows with the `@chain` macro:\n\n\n\n\n\n\n\n\n\n::: {#4 .cell execution_count=0}\n``` {.julia .cell-code}\n@chain penguins begin\n @filter !ismissing(sex)\n @group_by sex\n @summarise mean = mean(bill_length_mm)\n @arrange mean\nend\n```\n:::\n\n\n",
+ "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. \n\nA dataframe is a table where each row is an observation and each column is a variable.\n\n::: {.callout}\nA dataframe `df` is a list of vectors, all with the same length.\n\nA column of `df` is just one if its vectors.\n\nThe `i-th` row of `df` is the vector formed by the `i-th` coordinate of each of its columns.\n:::\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier, Chain\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Chinstrap
Dream
45.2
16.6
191
3250
female
334
Chinstrap
Dream
49.3
19.9
203
4050
male
335
Chinstrap
Dream
50.2
18.8
202
3800
male
336
Chinstrap
Dream
45.6
19.4
194
3525
female
337
Chinstrap
Dream
51.9
19.5
206
3950
male
338
Chinstrap
Dream
46.8
16.5
189
3650
female
339
Chinstrap
Dream
45.7
17.0
195
3650
female
340
Chinstrap
Dream
55.8
19.8
207
4000
male
341
Chinstrap
Dream
43.5
18.1
202
3400
female
342
Chinstrap
Dream
49.6
18.2
193
3775
male
343
Chinstrap
Dream
50.8
19.0
210
4100
male
344
Chinstrap
Dream
50.2
18.7
198
3775
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n## Libraries\n\n### Dataframes\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\n### DataFramesMeta\n\nDataFramesMeta is a collection of macros based on DataFrames. It provides many syntatic helpers to slice rows, create columns and summarise data.\n\n### Tidier\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.\n\n## Operations\n\nLet's start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.^[Join operations will be dealt later.]. We can divide these operations in some categories:\n\n### Rows operations\n\nThese are operations that only affect rows, leaving all columns untouched.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\n### Column operations\n\nThese are operations that only affect columns, leaving all rows untouched.\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n### Reshaping operations\n\nThese operations change the shape of a dataframe, making it wider or longer.\n\n- `Widening`\n\n- `Longering`?\n\n### Grouping operations\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n### Mixed operations\n\nThese operations can possibly change rows and columns at the same time.\n\n- Distinct;\n- Counting;\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n??? deixar grupo e sumário juntos?\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nNow for binary operations (ie. operations that take two dataframes), we have all the joins:\n\n- Left join;\n- Right join;\n- Inner join;\n- Outer join;\n- Full join.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `filter` | `@filter` | `@subset` / `@rsubset` | `subset` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n| `select` | `@select` | `@select` | array sintax |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n\nIt is clear that for those coming from `R`, Tidier will look like the most natural approach.\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\nWe will see each operation with more details in the following chapters.\n\n## Chaining operations\n\nWe can chain (or pipe) dataframe operations as follows with the `@chain` macro:\n\n\n\n\n\n::: {#4 .cell execution_count=0}\n``` {.julia .cell-code}\n@chain penguins begin\n @filter !ismissing(sex)\n @group_by sex\n @summarise mean = mean(bill_length_mm)\n @arrange mean\nend\n```\n:::\n\n\n\n\n\n\n\n## Using variables as column names\n\nIn Tidier, using the column names as if they were variables in the environment leads to some complication when we want to use other variables that are not column names.\n\nFor example, suppose you want to arrange penguins by a column that is stored in a variable.\n\nWhen this happens, we add `@eval` before the Tidier code and add a `$` to force evaluation of the variable, as in the following example:\n\n\n\n\n\n::: {#6 .cell execution_count=0}\n``` {.julia .cell-code}\nmy_arrange_column = :body_mass_g;\n\n@eval @arrange penguins $my_arrange_column\n```\n:::\n\n\n\n\n\n\n\n\n## Documentation\n\nhttps://dataframes.juliadata.org/stable/man/working_with_dataframes/\n\nhttps://juliadata.org/DataFramesMeta.jl/stable\n\nhttps://tidierorg.github.io/TidierData.jl/latest/reference/\n\n",
"supporting": [
- "dataframes_files/figure-html"
+ "dataframes_files"
],
"filters": [],
"includes": {
diff --git a/dataframes-columns.qmd b/dataframes-columns.qmd
index d24633d..9111d99 100644
--- a/dataframes-columns.qmd
+++ b/dataframes-columns.qmd
@@ -18,6 +18,8 @@ penguins = PalmerPenguins.load() |> DataFrame;
### Selecting `n` columns
+**Problem:** Select only some columns.
+
::: {.panel-tabset}
## Tidier
@@ -42,6 +44,8 @@ DFM.select(penguins, [:species, :body_mass_g])
### Selecting columns from a variable
+**Problem:** Select only some columns whose names are stored in a variable.
+
::: {.panel-tabset}
```{julia}
@@ -51,7 +55,7 @@ my_columns = [:species, :body_mass_g];
## Tidier
```{julia}
-@select penguins !!my_columns
+@eval @select penguins $my_columns...
```
## DataFramesMeta
@@ -72,7 +76,7 @@ DFM.select(penguins, my_columns)
### Creating one column based on another one
-Create the column `body_mass_kg` by dividing `body_mass_g` by 1000.
+**Problem:** Create the column `body_mass_kg` by dividing `body_mass_g` by 1000.
::: {.panel-tabset}
diff --git a/dataframes-rows.qmd b/dataframes-rows.qmd
index 2040b44..6048bb3 100644
--- a/dataframes-rows.qmd
+++ b/dataframes-rows.qmd
@@ -5,6 +5,10 @@ engine: julia
# Operations on rows
+In this chapter we will see operations that deal with rows, be it ordering or throwing some rows away.
+
+The following is necessary to run all examples:
+
```{julia}
using DataFrames, PalmerPenguins
using Tidier
@@ -14,11 +18,11 @@ penguins = PalmerPenguins.load() |> DataFrame;
@slice_head(penguins, n = 10)
```
-## Filtering (or: throwing lines away)
+## Filtering (or: throwing rows away)
-To filter a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).
+To *filter* a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).
-To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form
+To filter in Tidier, we use the macro `@filter`. You can use it in the form
```{julia}
@filter(penguins, species == "Adelie")
@@ -40,7 +44,7 @@ DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
Notice the broadcast on >=. We need it because *each variable is interpreted as a vector (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).
-In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
+In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (**r**ow subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
```{julia}
DFM.@rsubset penguins :species == "Adelie"
@@ -57,11 +61,11 @@ subset(penguins, :column => boolean_function)
```
-where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable. Add the kwarg `skipmissing=true` if you want to get rid of missing values.
+where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable (the `:column` you passed). Add the kwarg `skipmissing=true` if you want to get rid of missing values.
### Filtering with one criteria
-Filtering all the rows with `species` == "Adelie".
+**Problem:** Filtering all the rows with `species` == "Adelie".
::: {.panel-tabset}
@@ -87,7 +91,7 @@ subset(penguins, :species => x -> x .== "Adelie", skipmissing=true)
### Filtering with several criteria
-Filtering all the rows with `species` == "Adelie", `sex` == "male" and `body_mass_g` > 4000.
+**Problem:** Filtering all the rows with `species` == "Adelie", `sex` == "male" and `body_mass_g` > 4000.
::: {.panel-tabset}
@@ -116,8 +120,7 @@ subset(
:::
-
-Filtering all the rows with `species` == "Adelie" OR `sex` == "male".
+**Problem:** Filtering all the rows with `species` == "Adelie" OR `sex` == "male".
::: {.panel-tabset}
@@ -141,8 +144,11 @@ subset(penguins, [:species, :sex] => (x, y) -> (x .== "Adelie") .| (y .== "male"
:::
+### Filtering with metadata
-Filtering all the rows where the `flipper_length_mm` is greater than the mean.
+By metadata here we mean data that is inside the dataframe, as the mean/max/min of a column.
+
+**Problem:** Filtering all the rows where the `flipper_length_mm` is greater than the mean.
::: {.panel-tabset}
@@ -168,14 +174,22 @@ subset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissi
### Filtering with a variable column name
-Suppose the column you want to filter is a variable, let's say
+Suppose the column you want to filter is a variable, let's say a symbol
```{julia}
my_column = :species;
```
+**Problem:** Filtering all the rows where the column stored in `my_column` is "Adelie".
+
::: {.panel-tabset}
+## Tidier
+
+```{julia}
+@eval @filter penguins $my_column == "Adelie"
+```
+
## DataFramesMeta
```{julia}
@@ -196,16 +210,17 @@ In case the column is a string
my_column_string = "species";
```
-instead of a symbol, we can write in the same way
+instead of a symbol, we can write in the same way, just taking care in Tidier to convert it to a symbol
::: {.panel-tabset}
## Tidier
```{julia}
-# @filter(penguins, !!my_column == "Adelie")
+@eval @filter penguins $(Symbol(my_column_string)) == "Adelie"
```
+
## DataFramesMeta
```{julia}
@@ -222,11 +237,11 @@ subset(penguins, my_column_string => x -> x .== "Adelie")
## Arranging
-Arranging is when we reorder the rows of a dataframe according to some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.
+To *arrange* a dataframe means to reorder the rows according to the order of some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.
### Arranging by one column
-Arrange by `body_mass_g`.
+**Problem:** Arrange by `body_mass_g`.
::: {.panel-tabset}
@@ -252,7 +267,7 @@ sort(penguins, :body_mass_g)
### Arranging by two columns, with one reversed
-First arrange by `island`, then by reversed `body_mass_g`.
+**Problem:** First arrange by `island`, then by reversed `body_mass_g`.
::: {.panel-tabset}
@@ -280,7 +295,7 @@ sort(penguins, [order(:island), order(:body_mass_g, rev=true)])
### Arranging by one variable column
-Let's arrange the data by the following column:
+**Problem:** Arrange by a column stored in a variable `my_arrange_column`.
```{julia}
my_arrange_column = :body_mass_g;
@@ -291,8 +306,7 @@ my_arrange_column = :body_mass_g;
## Tidier
```{julia}
-#?? how to do it?
-# @arrange penguins !!my_arrange_column
+@eval @arrange penguins $my_arrange_column
```
## DataFramesMeta
diff --git a/dataframes-rows.quarto_ipynb b/dataframes-rows.quarto_ipynb
new file mode 100644
index 0000000..488b518
--- /dev/null
+++ b/dataframes-rows.quarto_ipynb
@@ -0,0 +1,772 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "jupyter: julia-1.10\n",
+ "# engine: julia\n",
+ "---\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "# Operations on rows\n"
+ ],
+ "id": "ea5be365"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "using DataFrames, PalmerPenguins\n",
+ "using Tidier\n",
+ "import DataFramesMeta as DFM\n",
+ "\n",
+ "penguins = PalmerPenguins.load() |> DataFrame;\n",
+ "@slice_head(penguins, n = 10)"
+ ],
+ "id": "4e698924",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Filtering (or: throwing lines away)\n",
+ "\n",
+ "To filter a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).\n",
+ "\n",
+ "To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n"
+ ],
+ "id": "8abe75b9"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter(penguins, species == \"Adelie\")"
+ ],
+ "id": "861ae2cd",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "or without parentesis as in \n"
+ ],
+ "id": "9c978b03"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter penguins species == \"Adelie\""
+ ],
+ "id": "5fc51708",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n",
+ "\n",
+ "In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses a whole column, for example:\n"
+ ],
+ "id": "c749e8e7"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))"
+ ],
+ "id": "5674e7ca",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice the broadcast on >=. We need it because *each variable is interpreted as a vector (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n",
+ "\n",
+ "In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n"
+ ],
+ "id": "650f2341"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins :species == \"Adelie\""
+ ],
+ "id": "165e3e30",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that `false` and `missing` are thrown away.\n",
+ "\n",
+ "In pure DataFrames, we use the `subset` function, and the criteria is passed with the notation\n"
+ ],
+ "id": "c19e21d8"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "#| eval: false\n",
+ "\n",
+ "subset(penguins, :column => boolean_function)"
+ ],
+ "id": "e52816cb",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable (the `:column` you passed). Add the kwarg `skipmissing=true` if you want to get rid of missing values.\n",
+ "\n",
+ "### Filtering with one criteria\n",
+ "\n",
+ "Filtering all the rows with `species` == \"Adelie\".\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "ebcd6346"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter penguins species == \"Adelie\""
+ ],
+ "id": "7fb1666d",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "e8f686ea"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins :species == \"Adelie\""
+ ],
+ "id": "95c17061",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "fa2c5547"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(penguins, :species => x -> x .== \"Adelie\", skipmissing=true)"
+ ],
+ "id": "6fb2812e",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "### Filtering with several criteria\n",
+ "\n",
+ "Filtering all the rows with `species` == \"Adelie\", `sex` == \"male\" and `body_mass_g` > 4000.\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "09049eb9"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000"
+ ],
+ "id": "11d29a51",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "0ce455c1"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000"
+ ],
+ "id": "cb5749ba",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "df8f2354"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(\n",
+ " penguins\n",
+ " , [:species, :sex, :body_mass_g] => \n",
+ " (x, y, z) -> (x .== \"Adelie\") .& (y .== \"male\") .& (z .> 4000)\n",
+ " ,skipmissing=true\n",
+ ")"
+ ],
+ "id": "7599d3f0",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "\n",
+ "Filtering all the rows with `species` == \"Adelie\" OR `sex` == \"male\".\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "db002280"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter penguins (species == \"Adelie\") | (sex == \"male\")"
+ ],
+ "id": "d28e9318",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "b3d63fe1"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins (:species == \"Adelie\") | (:sex == \"male\")"
+ ],
+ "id": "9276b145",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "e7096279"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(penguins, [:species, :sex] => (x, y) -> (x .== \"Adelie\") .| (y .== \"male\"), skipmissing=true)"
+ ],
+ "id": "a0668fa9",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "\n",
+ "Filtering all the rows where the `flipper_length_mm` is greater than the mean.\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "2a22c3ed"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))"
+ ],
+ "id": "a5ddbae0",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "be93d74e"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))"
+ ],
+ "id": "8d8d6b77",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "57b2b239"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true)"
+ ],
+ "id": "9ed74597",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "### Filtering with a variable column name\n",
+ "\n",
+ "Suppose the column you want to filter is a variable, let's say a symbol\n"
+ ],
+ "id": "33256162"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "my_column = :species;"
+ ],
+ "id": "493c5c7a",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "15661579"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@eval @filter penguins $my_column == \"Adelie\""
+ ],
+ "id": "b3965259",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "c07f6b56"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins $my_column == \"Adelie\""
+ ],
+ "id": "4624b99a",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "fcbfbc4b"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(penguins, my_column => x -> x .== \"Adelie\")"
+ ],
+ "id": "7066efde",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "In case the column is a string\n"
+ ],
+ "id": "c83c792f"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "my_column_string = \"species\";"
+ ],
+ "id": "756fd48f",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "instead of a symbol, we can write in the same way, just taking care in Tidier to convert it to a symbol\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "53362155"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@eval @filter penguins $(Symbol(my_column_string)) == \"Adelie\""
+ ],
+ "id": "0df46c80",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "38820cc4"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@rsubset penguins $(my_column_string) == \"Adelie\""
+ ],
+ "id": "642a18a8",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "ed35fc5d"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "subset(penguins, my_column_string => x -> x .== \"Adelie\")"
+ ],
+ "id": "38af65d1",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "## Arranging\n",
+ "\n",
+ "Arranging is when we reorder the rows of a dataframe according to some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.\n",
+ "\n",
+ "### Arranging by one column\n",
+ "\n",
+ "Arrange by `body_mass_g`.\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "791c4586"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@arrange penguins body_mass_g"
+ ],
+ "id": "39a99cf6",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "a5fe4174"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@orderby penguins :body_mass_g"
+ ],
+ "id": "548845b3",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "f866a4c9"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "sort(penguins, :body_mass_g)"
+ ],
+ "id": "0153f423",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "### Arranging by two columns, with one reversed\n",
+ "\n",
+ "First arrange by `island`, then by reversed `body_mass_g`.\n",
+ "\n",
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "0cc1eafa"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@arrange penguins island desc(body_mass_g)"
+ ],
+ "id": "0316cc75",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "c4d5ad8a"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "# works only when the reversed column is numeric?\n",
+ "\n",
+ "DFM.@orderby penguins :island :body_mass_g .* -1"
+ ],
+ "id": "337343f4",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "e77d71e7"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "sort(penguins, [order(:island), order(:body_mass_g, rev=true)])"
+ ],
+ "id": "3a7cc7c7",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::\n",
+ "\n",
+ "### Arranging by one variable column\n",
+ "\n",
+ "Let's arrange the data by the following column:\n"
+ ],
+ "id": "0bcd783c"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "my_arrange_column = :body_mass_g;"
+ ],
+ "id": "a8236e40",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "::: {.panel-tabset}\n",
+ "\n",
+ "## Tidier\n"
+ ],
+ "id": "6c69980b"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "@eval @arrange penguins $my_arrange_column"
+ ],
+ "id": "900bfcb2",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFramesMeta\n"
+ ],
+ "id": "a05a1200"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "DFM.@orderby penguins $my_arrange_column"
+ ],
+ "id": "874ac5cd",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DataFrames\n"
+ ],
+ "id": "1889a601"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "sort(penguins, my_arrange_column)"
+ ],
+ "id": "5e0515a5",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::"
+ ],
+ "id": "fac2f7e2"
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "name": "julia-1.10",
+ "language": "julia",
+ "display_name": "Julia 1.10.4",
+ "path": "/home/vituri/.local/share/jupyter/kernels/julia-1.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file
diff --git a/dataframes.qmd b/dataframes.qmd
index 80eee14..0af9cfe 100644
--- a/dataframes.qmd
+++ b/dataframes.qmd
@@ -129,10 +129,26 @@ We can chain (or pipe) dataframe operations as follows with the `@chain` macro:
end
```
+## Using variables as column names
+
+In Tidier, using the column names as if they were variables in the environment leads to some complication when we want to use other variables that are not column names.
+
+For example, suppose you want to arrange penguins by a column that is stored in a variable.
+
+When this happens, we add `@eval` before the Tidier code and add a `$` to force evaluation of the variable, as in the following example:
+
+```{julia}
+#| eval: false
+my_arrange_column = :body_mass_g;
+
+@eval @arrange penguins $my_arrange_column
+```
+
+
## Documentation
https://dataframes.juliadata.org/stable/man/working_with_dataframes/
-https://juliadata.org/DataFramesMeta.jl/stable/#@orderby
+https://juliadata.org/DataFramesMeta.jl/stable
https://tidierorg.github.io/TidierData.jl/latest/reference/
\ No newline at end of file
diff --git a/docs/dataframes-columns.html b/docs/dataframes-columns.html
index a2d7e03..4a5e46e 100644
--- a/docs/dataframes-columns.html
+++ b/docs/dataframes-columns.html
@@ -2,7 +2,7 @@
-
+
@@ -71,10 +71,10 @@
-
+
-
+
-
+
-
+
-
+
-
+