\n```\n:::\n:::\n\n\n\n\n\n\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.2
19.6
195
4675
male
2
Adelie
Torgersen
42.0
20.2
190
4250
missing
3
Adelie
Torgersen
34.6
21.1
198
4400
male
4
Adelie
Torgersen
42.5
20.7
197
4500
male
5
Adelie
Dream
39.8
19.1
184
4650
male
6
Adelie
Dream
44.1
19.7
196
4400
male
7
Adelie
Dream
39.6
18.8
190
4600
male
8
Adelie
Biscoe
40.1
18.9
188
4300
male
9
Adelie
Biscoe
41.3
21.1
195
4400
male
10
Adelie
Torgersen
41.8
19.4
198
4450
male
11
Adelie
Torgersen
42.8
18.5
195
4250
male
12
Adelie
Torgersen
42.9
17.6
196
4700
male
13
Adelie
Dream
41.1
18.1
205
4300
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
138
Gentoo
Biscoe
47.2
13.7
214
4925
female
139
Gentoo
Biscoe
46.8
14.3
215
4850
female
140
Gentoo
Biscoe
50.4
15.7
222
5750
male
141
Gentoo
Biscoe
45.2
14.8
212
5200
female
142
Gentoo
Biscoe
49.9
16.1
213
5400
male
143
Chinstrap
Dream
49.2
18.2
195
4400
male
144
Chinstrap
Dream
52.8
20.0
205
4550
male
145
Chinstrap
Dream
54.2
20.8
201
4300
male
146
Chinstrap
Dream
52.0
20.7
210
4800
male
147
Chinstrap
Dream
53.5
19.9
205
4500
male
148
Chinstrap
Dream
50.8
18.5
201
4450
male
149
Chinstrap
Dream
49.0
19.6
212
4300
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n\nIn the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n\n## Filtering with one criteria\n\nFiltering all the rows with `species` = \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#12 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Filtering with several criteria\n\nFiltering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nFiltering all the rows where the `flipper_length_mm` is greater than the mean.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#24 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Filtering with a variable column name\n\nSuppose the column you want to filter is a variable, let's say\n\n\n\n\n\n::: {#30 .cell execution_count=1}\n``` {.julia .cell-code}\n# filter_column = \"species\"\ncolumn_symbol = :species\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```\n:species\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#32 .cell execution_count=1}\n``` {.julia .cell-code}\n# @chain penguins begin\n# @filter(!!filter_column == \"Adelie\")\n# # @select(!!filter_column)\n# end\n# @filter(penguins, !!filter_column == \"Adelie\")\n```\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#34 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $column_symbol == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\nIn case the column is a string instead of a symbol, we can write\n\n\n\n\n\n::: {#36 .cell execution_count=1}\n``` {.julia .cell-code}\ncolumn_string = \"species\"\n\nDFM.@rsubset penguins $(Symbol(column_string)) == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n",
+ "supporting": [
+ "dataframes-filtering_files"
+ ],
+ "filters": [],
+ "includes": {
+ "include-in-header": [
+ "\n\n\n"
+ ]
+ }
+ }
+}
\ No newline at end of file
diff --git a/_freeze/dataframes/execute-results/html.json b/_freeze/dataframes/execute-results/html.json
index 5338562..ea227e6 100644
--- a/_freeze/dataframes/execute-results/html.json
+++ b/_freeze/dataframes/execute-results/html.json
@@ -1,8 +1,8 @@
{
- "hash": "07497448379d045b81637577fc049da2",
+ "hash": "103b5252701e836620eb447a28e1e311",
"result": {
"engine": "julia",
- "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n\n# Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Chinstrap
Dream
45.2
16.6
191
3250
female
334
Chinstrap
Dream
49.3
19.9
203
4050
male
335
Chinstrap
Dream
50.2
18.8
202
3800
male
336
Chinstrap
Dream
45.6
19.4
194
3525
female
337
Chinstrap
Dream
51.9
19.5
206
3950
male
338
Chinstrap
Dream
46.8
16.5
189
3650
female
339
Chinstrap
Dream
45.7
17.0
195
3650
female
340
Chinstrap
Dream
55.8
19.8
207
4000
male
341
Chinstrap
Dream
43.5
18.1
202
3400
female
342
Chinstrap
Dream
49.6
18.2
193
3775
male
343
Chinstrap
Dream
50.8
19.0
210
4100
male
344
Chinstrap
Dream
50.2
18.7
198
3775
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros \n\nTidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.\n\nIn this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!\n:::\n\n## Operations\n\nIn this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nLet's see each operation with more details.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `select` | `@select` | `@select` | array sintax |\n| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\n## Filtering / subsetting\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.2
19.6
195
4675
male
2
Adelie
Torgersen
42.0
20.2
190
4250
missing
3
Adelie
Torgersen
34.6
21.1
198
4400
male
4
Adelie
Torgersen
42.5
20.7
197
4500
male
5
Adelie
Dream
39.8
19.1
184
4650
male
6
Adelie
Dream
44.1
19.7
196
4400
male
7
Adelie
Dream
39.6
18.8
190
4600
male
8
Adelie
Biscoe
40.1
18.9
188
4300
male
9
Adelie
Biscoe
41.3
21.1
195
4400
male
10
Adelie
Torgersen
41.8
19.4
198
4450
male
11
Adelie
Torgersen
42.8
18.5
195
4250
male
12
Adelie
Torgersen
42.9
17.6
196
4700
male
13
Adelie
Dream
41.1
18.1
205
4300
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
138
Gentoo
Biscoe
47.2
13.7
214
4925
female
139
Gentoo
Biscoe
46.8
14.3
215
4850
female
140
Gentoo
Biscoe
50.4
15.7
222
5750
male
141
Gentoo
Biscoe
45.2
14.8
212
5200
female
142
Gentoo
Biscoe
49.9
16.1
213
5400
male
143
Chinstrap
Dream
49.2
18.2
195
4400
male
144
Chinstrap
Dream
52.8
20.0
205
4550
male
145
Chinstrap
Dream
54.2
20.8
201
4300
male
146
Chinstrap
Dream
52.0
20.7
210
4800
male
147
Chinstrap
Dream
53.5
19.9
205
4500
male
148
Chinstrap
Dream
50.8
18.5
201
4450
male
149
Chinstrap
Dream
49.0
19.6
212
4300
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).\n\nIn this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
141
Adelie
Dream
40.2
17.1
193
3400
female
142
Adelie
Dream
40.6
17.2
187
3475
male
143
Adelie
Dream
32.1
15.5
188
3050
female
144
Adelie
Dream
40.7
17.0
190
3725
male
145
Adelie
Dream
37.3
16.8
192
3000
female
146
Adelie
Dream
39.0
18.7
185
3650
male
147
Adelie
Dream
39.2
18.6
190
4250
male
148
Adelie
Dream
36.6
18.4
184
3475
female
149
Adelie
Dream
36.0
17.8
195
3450
female
150
Adelie
Dream
37.8
18.1
193
3750
male
151
Adelie
Dream
36.0
17.1
187
3700
female
152
Adelie
Dream
41.5
18.5
201
4000
male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n\n### Filtering with one criteria\n\nFiltering all the rows with `species` = \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n\n::: {#12 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n### Filtering with several criteria\n\nFiltering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n\n## Creating columns\n\n::: {.panel-tabset}\n\n## Tidier\n\n## DataFramesMeta\n\n## DataFrames\n\n:::\n\n",
+ "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Chinstrap
Dream
45.2
16.6
191
3250
female
334
Chinstrap
Dream
49.3
19.9
203
4050
male
335
Chinstrap
Dream
50.2
18.8
202
3800
male
336
Chinstrap
Dream
45.6
19.4
194
3525
female
337
Chinstrap
Dream
51.9
19.5
206
3950
male
338
Chinstrap
Dream
46.8
16.5
189
3650
female
339
Chinstrap
Dream
45.7
17.0
195
3650
female
340
Chinstrap
Dream
55.8
19.8
207
4000
male
341
Chinstrap
Dream
43.5
18.1
202
3400
female
342
Chinstrap
Dream
49.6
18.2
193
3775
male
343
Chinstrap
Dream
50.8
19.0
210
4100
male
344
Chinstrap
Dream
50.2
18.7
198
3775
female
\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros based on DataFrames.\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them.\n:::\n\n## Operations\n\nLet's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology:\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nLet's see each operation with more details.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `select` | `@select` | `@select` | array sintax |\n| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\n",
"supporting": [
"dataframes_files"
],
diff --git a/_quarto.yml b/_quarto.yml
index 2412282..b14d4f8 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -16,14 +16,26 @@ book:
title: "Tidier Data Science with Julia"
author: "Guilherme Vituri and Christoph Scheuch"
date: "15/08/2024"
+ repo-url: https://github.com/vituri/TidierBook2
+
+ page-navigation: true
reader-mode: true
+ page-footer:
+ left: |
+ This book is part of the Tidier organization, bringing joy to
+ data science in Julia.
+ right: |
+ This book was built with Quarto.
chapters:
- index.qmd
- part: "Part 1: Julia basics"
+ # chapters:
+ # - dataframes.qmd
+ - part: dataframes.qmd
chapters:
- - dataframes.qmd
- - part: "Part 2: Manipulating data"
+ - dataframes-filtering.qmd
+ # - part: "Part 2: Dataframes"
- part: "Part 3: Reading data"
- part: "Part 4: Plotting data"
- part: "Part 5: Applications"
diff --git a/dataframes-filtering.qmd b/dataframes-filtering.qmd
new file mode 100644
index 0000000..a0c616f
--- /dev/null
+++ b/dataframes-filtering.qmd
@@ -0,0 +1,159 @@
+---
+# jupyter: julia-1.10
+engine: julia
+---
+
+# Filtering
+
+```{julia}
+using DataFrames, PalmerPenguins
+using Tidier
+import DataFramesMeta as DFM
+
+penguins = PalmerPenguins.load() |> DataFrame;
+@slice_head(penguins, n = 15)
+```
+
+To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form
+
+```{julia}
+@filter(penguins, species == "Adelie")
+```
+
+or without parentesis as in
+
+```{julia}
+@filter penguins species == "Adelie"
+```
+
+Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe as columns.
+
+In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:
+
+```{julia}
+DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
+```
+
+Notice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).
+
+In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
+
+```{julia}
+DFM.@rsubset penguins :species == "Adelie"
+```
+
+In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.
+
+## Filtering with one criteria
+
+Filtering all the rows with `species` = "Adelie".
+
+::: {.panel-tabset}
+
+## Tidier
+
+```{julia}
+@filter penguins species == "Adelie"
+```
+
+## DataFramesMeta
+
+```{julia}
+DFM.@rsubset penguins :species == "Adelie"
+```
+
+## DataFrames
+
+```{julia}
+filter(r -> r.species == "Adelie", penguins)
+```
+
+:::
+
+## Filtering with several criteria
+
+Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000.
+
+::: {.panel-tabset}
+
+## Tidier
+
+```{julia}
+@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
+```
+
+## DataFramesMeta
+
+```{julia}
+DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
+```
+
+## DataFrames
+
+```{julia}
+filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins)
+```
+
+:::
+
+
+Filtering all the rows where the `flipper_length_mm` is greater than the mean.
+
+::: {.panel-tabset}
+
+## Tidier
+
+```{julia}
+@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))
+```
+
+## DataFramesMeta
+
+```{julia}
+DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))
+```
+
+## DataFrames
+
+```{julia}
+filter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins)
+```
+
+:::
+
+## Filtering with a variable column name
+
+Suppose the column you want to filter is a variable, let's say
+
+```{julia}
+# filter_column = "species"
+column_symbol = :species
+```
+
+::: {.panel-tabset}
+
+## Tidier
+
+```{julia}
+# @chain penguins begin
+# @filter(!!filter_column == "Adelie")
+# # @select(!!filter_column)
+# end
+# @filter(penguins, !!filter_column == "Adelie")
+```
+
+## DataFramesMeta
+
+```{julia}
+DFM.@rsubset penguins $column_symbol == "Adelie"
+```
+
+:::
+
+In case the column is a string instead of a symbol, we can write
+
+```{julia}
+column_string = "species"
+
+DFM.@rsubset penguins $(Symbol(column_string)) == "Adelie"
+```
\ No newline at end of file
diff --git a/dataframes-mutating.qmd b/dataframes-mutating.qmd
new file mode 100644
index 0000000..e58845d
--- /dev/null
+++ b/dataframes-mutating.qmd
@@ -0,0 +1,16 @@
+---
+# jupyter: julia-1.10
+engine: julia
+---
+
+## Creating columns
+
+::: {.panel-tabset}
+
+## Tidier
+
+## DataFramesMeta
+
+## DataFrames
+
+:::
\ No newline at end of file
diff --git a/dataframes.qmd b/dataframes.qmd
index 36c7526..0ce7baa 100644
--- a/dataframes.qmd
+++ b/dataframes.qmd
@@ -3,7 +3,7 @@
engine: julia
---
-# Dataframes
+# Part 2: Dataframes
Dataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.
@@ -11,7 +11,7 @@ We will use the Palmer Penguin dataset as a toy example for the remaining of the
```{julia}
using DataFrames, PalmerPenguins
-using Tidier
+using Tidier, Chain
import DataFramesMeta as DFM
penguins = PalmerPenguins.load() |> DataFrame
@@ -21,33 +21,31 @@ penguins = PalmerPenguins.load() |> DataFrame
`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.
-DataFramesMeta is a collection of macros
+DataFramesMeta is a collection of macros based on DataFrames.
-Tidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.
+Tidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this "tidy" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).
-In this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!
+In this book, whenever possible, we will show the different approaches in a tabset so you can compare them.
:::
## Operations
-In this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.
+Let's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology:
- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.
- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.
-- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.
+- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.
- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.
-- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.
+- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.
- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.
Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.
-Let's see each operation with more details.
-
## Comparing Tidier with DataFramesMeta
The following table list the operations on each package:
@@ -61,102 +59,21 @@ The following table list the operations on each package:
| `summarise` | `@summarise` | `@combine` | `combine` |
| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |
+It is clear that for those coming from `R`, Tidier will look like the most natural approach.
Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.
-## Filtering / subsetting
-
-To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form
+We will see each operation with more details in the following chapters.
-```{julia}
-@filter(penguins, species == "Adelie")
-```
+## Chaining operations
-or without parentesis as in
+We can chain (or pipe) dataframe operations as follows with the `@chain` macro:
```{julia}
-@filter penguins species == "Adelie"
-```
-
-Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe.
-
-In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:
-
-```{julia}
-DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
-```
-
-Notice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).
-
-In this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
-
-```{julia}
-DFM.@rsubset penguins :species == "Adelie"
-```
-
-In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.
-
-### Filtering with one criteria
-
-Filtering all the rows with `species` = "Adelie".
-
-::: {.panel-tabset}
-
-## Tidier
-
-```{julia}
-@filter penguins species == "Adelie"
-```
-
-## DataFramesMeta
-
-```{julia}
-DFM.@rsubset penguins :species == "Adelie"
-```
-
-## DataFrames
-
-```{julia}
-filter(r -> r.species == "Adelie", penguins)
-```
-
-:::
-
-### Filtering with several criteria
-
-Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000.
-
-::: {.panel-tabset}
-
-## Tidier
-
-```{julia}
-@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
-```
-
-## DataFramesMeta
-
-```{julia}
-DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
-```
-
-## DataFrames
-
-```{julia}
-filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins)
-```
-
-:::
-
-
-## Creating columns
-
-::: {.panel-tabset}
-
-## Tidier
-
-## DataFramesMeta
-
-## DataFrames
-
-:::
\ No newline at end of file
+@chain penguins begin
+ @filter !ismissing(sex)
+ @group_by sex
+ @summarise mean = mean(bill_length_mm)
+ @arrange mean
+end
+```
\ No newline at end of file
diff --git a/dataframes.quarto_ipynb b/dataframes.quarto_ipynb
deleted file mode 100644
index 89bc9f7..0000000
--- a/dataframes.quarto_ipynb
+++ /dev/null
@@ -1,319 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "---\n",
- "jupyter: julia-1.10\n",
- "---\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "# Dataframes\n",
- "\n",
- "Dataframes are on of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n",
- "\n",
- "We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n"
- ],
- "id": "ff50af05"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "using DataFrames, PalmerPenguins\n",
- "using Tidier\n",
- "import DataFramesMeta as DFM\n",
- "\n",
- "penguins = PalmerPenguins.load() |> DataFrame"
- ],
- "id": "68ceafc8",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "::: {.callout-note}\n",
- "\n",
- "`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n",
- "\n",
- "DataFramesMeta is a collection of macros \n",
- "\n",
- "Tidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.\n",
- "\n",
- "In this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!\n",
- ":::\n",
- "\n",
- "## Operations\n",
- "\n",
- "In this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.\n",
- "\n",
- "- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n",
- "\n",
- "- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n",
- "\n",
- "- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.\n",
- "\n",
- "- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n",
- "\n",
- "- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n",
- "\n",
- "- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n",
- "\n",
- "Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n",
- "\n",
- "Let's see each operation with more details.\n",
- "\n",
- "## Comparing Tidier with DataFramesMeta\n",
- "\n",
- "The following table list the operations on each package:\n",
- "\n",
- "| dplyr | Tidier | DataFramesMeta | DataFrames |\n",
- "|-------------|--------------|------------------------------|--------------|\n",
- "| `select` | `@select` | `@select` | array sintax |\n",
- "| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n",
- "| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n",
- "| `group_by` | `@group_by` | `@groupby` | `groupby` |\n",
- "| `summarise` | `@summarise` | `@combine` | `combine` |\n",
- "| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n",
- "\n",
- "\n",
- "Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM`.\n",
- "\n",
- "## Filtering / subsetting\n",
- "\n",
- "To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n"
- ],
- "id": "b62dba04"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "@filter(penguins, species == \"Adelie\")"
- ],
- "id": "aa87bf63",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "or without parentesis as in \n"
- ],
- "id": "e54568e0"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "@filter penguins species == \"Adelie\""
- ],
- "id": "447cc245",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe.\n",
- "\n",
- "In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n"
- ],
- "id": "1ac1b7a6"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))"
- ],
- "id": "996bd089",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Notice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).\n",
- "\n",
- "In this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n"
- ],
- "id": "c1fa3b5d"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "DFM.@rsubset penguins :species == \"Adelie\""
- ],
- "id": "2bfff26f",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n",
- "\n",
- "### Filtering with one criteria\n",
- "\n",
- "Filtering all the rows with `species` = \"Adelie\".\n",
- "\n",
- "::: {.panel-tabset}\n",
- "\n",
- "## Tidier\n"
- ],
- "id": "3ab3f109"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "@filter penguins species == \"Adelie\""
- ],
- "id": "145b4dbe",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DataFramesMeta\n"
- ],
- "id": "ec5ac109"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "DFM.@rsubset penguins :species == \"Adelie\""
- ],
- "id": "23034937",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DataFrames\n"
- ],
- "id": "2e1677c6"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "filter(r -> r.species == \"Adelie\", penguins)"
- ],
- "id": "db6373f2",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- ":::\n",
- "\n",
- "### Filtering with several criteria\n",
- "\n",
- "Filtering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n",
- "\n",
- "::: {.panel-tabset}\n",
- "\n",
- "## Tidier\n"
- ],
- "id": "34324ff3"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000"
- ],
- "id": "d09aae2f",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DataFramesMeta\n"
- ],
- "id": "b62d6a8f"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "DFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000"
- ],
- "id": "859fc6c6",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DataFrames\n"
- ],
- "id": "94de5470"
- },
- {
- "cell_type": "code",
- "metadata": {},
- "source": [
- "filter(r -> ((r.species == \"Adelie\") & (r.sex == \"male\") & (r.body_mass_g > 4000)) == true, penguins)"
- ],
- "id": "8b2476e0",
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- ":::\n",
- "\n",
- "\n",
- "## Creating columns\n",
- "\n",
- "::: {.panel-tabset}\n",
- "\n",
- "## Tidier\n",
- "\n",
- "## DataFramesMeta\n",
- "\n",
- "## DataFrames\n",
- "\n",
- ":::"
- ],
- "id": "34a91faa"
- }
- ],
- "metadata": {
- "kernelspec": {
- "name": "julia-1.10",
- "language": "julia",
- "display_name": "Julia 1.10.4",
- "path": "/home/vituri/.local/share/jupyter/kernels/julia-1.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/docs/.nojekyll b/docs/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/docs/dataframes-filtering.html b/docs/dataframes-filtering.html
new file mode 100644
index 0000000..94ff961
--- /dev/null
+++ b/docs/dataframes-filtering.html
@@ -0,0 +1,5360 @@
+
+
+
+
+
+
+
+
+
+1 Filtering – Tidier Data Science with Julia
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
usingDataFrames, PalmerPenguins
+usingTidier
+importDataFramesMeta as DFM
+
+penguins = PalmerPenguins.load() |> DataFrame;
+@slice_head(penguins, n =15)
+
+
15×7 DataFrame
+
+
+
+
Row
+
species
+
island
+
bill_length_mm
+
bill_depth_mm
+
flipper_length_mm
+
body_mass_g
+
sex
+
+
+
+
String15
+
String15
+
Float64?
+
Float64?
+
Int64?
+
Int64?
+
String7
+
+
+
+
+
1
+
Adelie
+
Torgersen
+
39.1
+
18.7
+
181
+
3750
+
male
+
+
+
2
+
Adelie
+
Torgersen
+
39.5
+
17.4
+
186
+
3800
+
female
+
+
+
3
+
Adelie
+
Torgersen
+
40.3
+
18.0
+
195
+
3250
+
female
+
+
+
4
+
Adelie
+
Torgersen
+
missing
+
missing
+
missing
+
missing
+
missing
+
+
+
5
+
Adelie
+
Torgersen
+
36.7
+
19.3
+
193
+
3450
+
female
+
+
+
6
+
Adelie
+
Torgersen
+
39.3
+
20.6
+
190
+
3650
+
male
+
+
+
7
+
Adelie
+
Torgersen
+
38.9
+
17.8
+
181
+
3625
+
female
+
+
+
8
+
Adelie
+
Torgersen
+
39.2
+
19.6
+
195
+
4675
+
male
+
+
+
9
+
Adelie
+
Torgersen
+
34.1
+
18.1
+
193
+
3475
+
missing
+
+
+
10
+
Adelie
+
Torgersen
+
42.0
+
20.2
+
190
+
4250
+
missing
+
+
+
11
+
Adelie
+
Torgersen
+
37.8
+
17.1
+
186
+
3300
+
missing
+
+
+
12
+
Adelie
+
Torgersen
+
37.8
+
17.3
+
180
+
3700
+
missing
+
+
+
13
+
Adelie
+
Torgersen
+
41.1
+
17.6
+
182
+
3200
+
female
+
+
+
14
+
Adelie
+
Torgersen
+
38.6
+
21.2
+
191
+
3800
+
male
+
+
+
15
+
Adelie
+
Torgersen
+
34.6
+
21.1
+
198
+
4400
+
male
+
+
+
+
+
+
+
To filter a dataframe in Tidier, we use the macro @filter. You can use it in the form
+
+
@filter(penguins, species =="Adelie")
+
+
152×7 DataFrame
127 rows omitted
+
+
+
+
Row
+
species
+
island
+
bill_length_mm
+
bill_depth_mm
+
flipper_length_mm
+
body_mass_g
+
sex
+
+
+
+
String15
+
String15
+
Float64?
+
Float64?
+
Int64?
+
Int64?
+
String7
+
+
+
+
+
1
+
Adelie
+
Torgersen
+
39.1
+
18.7
+
181
+
3750
+
male
+
+
+
2
+
Adelie
+
Torgersen
+
39.5
+
17.4
+
186
+
3800
+
female
+
+
+
3
+
Adelie
+
Torgersen
+
40.3
+
18.0
+
195
+
3250
+
female
+
+
+
4
+
Adelie
+
Torgersen
+
missing
+
missing
+
missing
+
missing
+
missing
+
+
+
5
+
Adelie
+
Torgersen
+
36.7
+
19.3
+
193
+
3450
+
female
+
+
+
6
+
Adelie
+
Torgersen
+
39.3
+
20.6
+
190
+
3650
+
male
+
+
+
7
+
Adelie
+
Torgersen
+
38.9
+
17.8
+
181
+
3625
+
female
+
+
+
8
+
Adelie
+
Torgersen
+
39.2
+
19.6
+
195
+
4675
+
male
+
+
+
9
+
Adelie
+
Torgersen
+
34.1
+
18.1
+
193
+
3475
+
missing
+
+
+
10
+
Adelie
+
Torgersen
+
42.0
+
20.2
+
190
+
4250
+
missing
+
+
+
11
+
Adelie
+
Torgersen
+
37.8
+
17.1
+
186
+
3300
+
missing
+
+
+
12
+
Adelie
+
Torgersen
+
37.8
+
17.3
+
180
+
3700
+
missing
+
+
+
13
+
Adelie
+
Torgersen
+
41.1
+
17.6
+
182
+
3200
+
female
+
+
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
+
+
141
+
Adelie
+
Dream
+
40.2
+
17.1
+
193
+
3400
+
female
+
+
+
142
+
Adelie
+
Dream
+
40.6
+
17.2
+
187
+
3475
+
male
+
+
+
143
+
Adelie
+
Dream
+
32.1
+
15.5
+
188
+
3050
+
female
+
+
+
144
+
Adelie
+
Dream
+
40.7
+
17.0
+
190
+
3725
+
male
+
+
+
145
+
Adelie
+
Dream
+
37.3
+
16.8
+
192
+
3000
+
female
+
+
+
146
+
Adelie
+
Dream
+
39.0
+
18.7
+
185
+
3650
+
male
+
+
+
147
+
Adelie
+
Dream
+
39.2
+
18.6
+
190
+
4250
+
male
+
+
+
148
+
Adelie
+
Dream
+
36.6
+
18.4
+
184
+
3475
+
female
+
+
+
149
+
Adelie
+
Dream
+
36.0
+
17.8
+
195
+
3450
+
female
+
+
+
150
+
Adelie
+
Dream
+
37.8
+
18.1
+
193
+
3750
+
male
+
+
+
151
+
Adelie
+
Dream
+
36.0
+
17.1
+
187
+
3700
+
female
+
+
+
152
+
Adelie
+
Dream
+
41.5
+
18.5
+
201
+
4000
+
male
+
+
+
+
+
+
+
or without parentesis as in
+
+
@filter penguins species =="Adelie"
+
+
152×7 DataFrame
127 rows omitted
+
+
+
+
Row
+
species
+
island
+
bill_length_mm
+
bill_depth_mm
+
flipper_length_mm
+
body_mass_g
+
sex
+
+
+
+
String15
+
String15
+
Float64?
+
Float64?
+
Int64?
+
Int64?
+
String7
+
+
+
+
+
1
+
Adelie
+
Torgersen
+
39.1
+
18.7
+
181
+
3750
+
male
+
+
+
2
+
Adelie
+
Torgersen
+
39.5
+
17.4
+
186
+
3800
+
female
+
+
+
3
+
Adelie
+
Torgersen
+
40.3
+
18.0
+
195
+
3250
+
female
+
+
+
4
+
Adelie
+
Torgersen
+
missing
+
missing
+
missing
+
missing
+
missing
+
+
+
5
+
Adelie
+
Torgersen
+
36.7
+
19.3
+
193
+
3450
+
female
+
+
+
6
+
Adelie
+
Torgersen
+
39.3
+
20.6
+
190
+
3650
+
male
+
+
+
7
+
Adelie
+
Torgersen
+
38.9
+
17.8
+
181
+
3625
+
female
+
+
+
8
+
Adelie
+
Torgersen
+
39.2
+
19.6
+
195
+
4675
+
male
+
+
+
9
+
Adelie
+
Torgersen
+
34.1
+
18.1
+
193
+
3475
+
missing
+
+
+
10
+
Adelie
+
Torgersen
+
42.0
+
20.2
+
190
+
4250
+
missing
+
+
+
11
+
Adelie
+
Torgersen
+
37.8
+
17.1
+
186
+
3300
+
missing
+
+
+
12
+
Adelie
+
Torgersen
+
37.8
+
17.3
+
180
+
3700
+
missing
+
+
+
13
+
Adelie
+
Torgersen
+
41.1
+
17.6
+
182
+
3200
+
female
+
+
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
+
+
141
+
Adelie
+
Dream
+
40.2
+
17.1
+
193
+
3400
+
female
+
+
+
142
+
Adelie
+
Dream
+
40.6
+
17.2
+
187
+
3475
+
male
+
+
+
143
+
Adelie
+
Dream
+
32.1
+
15.5
+
188
+
3050
+
female
+
+
+
144
+
Adelie
+
Dream
+
40.7
+
17.0
+
190
+
3725
+
male
+
+
+
145
+
Adelie
+
Dream
+
37.3
+
16.8
+
192
+
3000
+
female
+
+
+
146
+
Adelie
+
Dream
+
39.0
+
18.7
+
185
+
3650
+
male
+
+
+
147
+
Adelie
+
Dream
+
39.2
+
18.6
+
190
+
4250
+
male
+
+
+
148
+
Adelie
+
Dream
+
36.6
+
18.4
+
184
+
3475
+
female
+
+
+
149
+
Adelie
+
Dream
+
36.0
+
17.8
+
195
+
3450
+
female
+
+
+
150
+
Adelie
+
Dream
+
37.8
+
18.1
+
193
+
3750
+
male
+
+
+
151
+
Adelie
+
Dream
+
36.0
+
17.1
+
187
+
3700
+
female
+
+
+
152
+
Adelie
+
Dream
+
41.5
+
18.5
+
201
+
4000
+
male
+
+
+
+
+
+
+
Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the tidyverse behaviour of data-masking: inside a tidyverse verb, the columns are taken as “statistical variables” that exist inside the dataframe as columns.
+
In DataFramesMeta, we have two macros for filtering: @subset and @rsubset. Use the first when you have some criteria that uses the whole dataframe, for example:
Notice the broadcast on >=. We need it because each row is interpreted as an array. Also, notice that we refer to columns as symbols (i.e. we append : to it).
+
In the above example, we needed the whole column body_mass_g to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then @rsubset (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
+
+
DFM.@rsubset penguins :species =="Adelie"
+
+
152×7 DataFrame
127 rows omitted
+
+
+
+
Row
+
species
+
island
+
bill_length_mm
+
bill_depth_mm
+
flipper_length_mm
+
body_mass_g
+
sex
+
+
+
+
String15
+
String15
+
Float64?
+
Float64?
+
Int64?
+
Int64?
+
String7
+
+
+
+
+
1
+
Adelie
+
Torgersen
+
39.1
+
18.7
+
181
+
3750
+
male
+
+
+
2
+
Adelie
+
Torgersen
+
39.5
+
17.4
+
186
+
3800
+
female
+
+
+
3
+
Adelie
+
Torgersen
+
40.3
+
18.0
+
195
+
3250
+
female
+
+
+
4
+
Adelie
+
Torgersen
+
missing
+
missing
+
missing
+
missing
+
missing
+
+
+
5
+
Adelie
+
Torgersen
+
36.7
+
19.3
+
193
+
3450
+
female
+
+
+
6
+
Adelie
+
Torgersen
+
39.3
+
20.6
+
190
+
3650
+
male
+
+
+
7
+
Adelie
+
Torgersen
+
38.9
+
17.8
+
181
+
3625
+
female
+
+
+
8
+
Adelie
+
Torgersen
+
39.2
+
19.6
+
195
+
4675
+
male
+
+
+
9
+
Adelie
+
Torgersen
+
34.1
+
18.1
+
193
+
3475
+
missing
+
+
+
10
+
Adelie
+
Torgersen
+
42.0
+
20.2
+
190
+
4250
+
missing
+
+
+
11
+
Adelie
+
Torgersen
+
37.8
+
17.1
+
186
+
3300
+
missing
+
+
+
12
+
Adelie
+
Torgersen
+
37.8
+
17.3
+
180
+
3700
+
missing
+
+
+
13
+
Adelie
+
Torgersen
+
41.1
+
17.6
+
182
+
3200
+
female
+
+
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
⋮
+
+
+
141
+
Adelie
+
Dream
+
40.2
+
17.1
+
193
+
3400
+
female
+
+
+
142
+
Adelie
+
Dream
+
40.6
+
17.2
+
187
+
3475
+
male
+
+
+
143
+
Adelie
+
Dream
+
32.1
+
15.5
+
188
+
3050
+
female
+
+
+
144
+
Adelie
+
Dream
+
40.7
+
17.0
+
190
+
3725
+
male
+
+
+
145
+
Adelie
+
Dream
+
37.3
+
16.8
+
192
+
3000
+
female
+
+
+
146
+
Adelie
+
Dream
+
39.0
+
18.7
+
185
+
3650
+
male
+
+
+
147
+
Adelie
+
Dream
+
39.2
+
18.6
+
190
+
4250
+
male
+
+
+
148
+
Adelie
+
Dream
+
36.6
+
18.4
+
184
+
3475
+
female
+
+
+
149
+
Adelie
+
Dream
+
36.0
+
17.8
+
195
+
3450
+
female
+
+
+
150
+
Adelie
+
Dream
+
37.8
+
18.1
+
193
+
3750
+
male
+
+
+
151
+
Adelie
+
Dream
+
36.0
+
17.1
+
187
+
3700
+
female
+
+
+
152
+
Adelie
+
Dream
+
41.5
+
18.5
+
201
+
4000
+
male
+
+
+
+
+
+
+
In both Tidier and DataFramesMeta, only the rows to which the criteria is true are returned. This means that you don’t need to worry about missing values in cases where the criteria do not return false nor `true.
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/dataframes.html b/docs/dataframes.html
index 0314771..47eab8e 100644
--- a/docs/dataframes.html
+++ b/docs/dataframes.html
@@ -7,7 +7,7 @@
-1 Dataframes – Tidier Data Science with Julia
+Part 2: Dataframes – Tidier Data Science with Julia