From 57e1de056c84f7fcec35a030f18dcbf7ae5a1dd6 Mon Sep 17 00:00:00 2001 From: "G. Vituri" <56522687+vituri@users.noreply.github.com> Date: Fri, 13 Sep 2024 01:56:37 -0300 Subject: [PATCH] add dataframes examples with subset --- Project.toml | 1 + .../dataframes-rows/execute-results/html.json | 16 + _freeze/dataframes/execute-results/html.json | 4 +- _quarto.yml | 4 +- dataframes-columns.qmd | 19 + dataframes-filtering.qmd | 159 - ...ames-mutating.qmd => dataframes-groups.qmd | 0 dataframes-reshape.qmd | 16 + dataframes-rows.qmd | 233 + dataframes.qmd | 48 +- docs/dataframes-rows.html | 7745 +++++++++++++++++ docs/dataframes.html | 117 +- docs/index.html | 18 +- docs/search.json | 49 +- 14 files changed, 8228 insertions(+), 201 deletions(-) create mode 100644 _freeze/dataframes-rows/execute-results/html.json create mode 100644 dataframes-columns.qmd delete mode 100644 dataframes-filtering.qmd rename dataframes-mutating.qmd => dataframes-groups.qmd (100%) create mode 100644 dataframes-reshape.qmd create mode 100644 dataframes-rows.qmd create mode 100644 docs/dataframes-rows.html diff --git a/Project.toml b/Project.toml index f3f132d..e361e97 100644 --- a/Project.toml +++ b/Project.toml @@ -1,4 +1,5 @@ [deps] +Chain = "8be319e6-bccf-4806-a6f7-6fae938471bc" DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964" PalmerPenguins = "8b842266-38fa-440a-9b57-31493939ab85" diff --git a/_freeze/dataframes-rows/execute-results/html.json b/_freeze/dataframes-rows/execute-results/html.json new file mode 100644 index 0000000..a5c46a7 --- /dev/null +++ b/_freeze/dataframes-rows/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "259174b4bca01a0c1c6b0a573b8e2131", + "result": { + "engine": "julia", + "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Operations on rows\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame;\n@slice_head(penguins, n = 15)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
15×7 DataFrame
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
14AdelieTorgersen38.621.21913800male
15AdelieTorgersen34.621.11984400male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## Filtering\n\nTo filter is to keep only the rows that satisfy a certain criteria (ie. a boolean condition).\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses a whole column, for example:\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen42.020.21904250missing
3AdelieTorgersen34.621.11984400male
4AdelieTorgersen42.520.71974500male
5AdelieDream39.819.11844650male
6AdelieDream44.119.71964400male
7AdelieDream39.618.81904600male
8AdelieBiscoe40.118.91884300male
9AdelieBiscoe41.321.11954400male
10AdelieTorgersen41.819.41984450male
11AdelieTorgersen42.818.51954250male
12AdelieTorgersen42.917.61964700male
13AdelieDream41.118.12054300male
138GentooBiscoe47.213.72144925female
139GentooBiscoe46.814.32154850female
140GentooBiscoe50.415.72225750male
141GentooBiscoe45.214.82125200female
142GentooBiscoe49.916.12135400male
143ChinstrapDream49.218.21954400male
144ChinstrapDream52.820.02054550male
145ChinstrapDream54.220.82014300male
146ChinstrapDream52.020.72104800male
147ChinstrapDream53.519.92054500male
148ChinstrapDream50.818.52014450male
149ChinstrapDream49.019.62124300male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because *each variable is interpreted as an array (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n\nIn the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that `false` and `missing` are thrown away.\n\nIn DataFrames, we use the `subset` function, and the criteria is passed with the notation\n\n\n\n\n\n::: {#12 .cell execution_count=0}\n``` {.julia .cell-code}\nsubset(penguins, :column => boolean_function)\n\n```\n:::\n\n\n\n\n\n\n\nwhere `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable. Add the kwarg `skipmissing=true` if you want to get rid of missing values.\n\n### Filtering with one criteria\n\nFiltering all the rows with `species` = \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#14 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#16 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :species => x -> x .== \"Adelie\", skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with several criteria\n\nFiltering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#20 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#22 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#24 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, [:species, :sex, :body_mass_g] => (x, y, z) -> (x .== \"Adelie\") .& (y .== \"male\") .& (z .> 4000), skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nFiltering all the rows with `species` = \"Adelie\" OR `sex` = \"male\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#26 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins (species == \"Adelie\") | (sex == \"male\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
247×7 DataFrame
222 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#28 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins (:species == \"Adelie\") | (:sex == \"male\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
247×7 DataFrame
222 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#30 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, [:species, :sex] => (x, y) -> (x .== \"Adelie\") .| (y .== \"male\"), skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
247×7 DataFrame
222 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nFiltering all the rows where the `flipper_length_mm` is greater than the mean.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#32 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#34 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#36 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n### Filtering with a variable column name\n\nSuppose the column you want to filter is a variable, let's say\n\n\n\n\n\n::: {#38 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column = :species\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```\n:species\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#40 .cell execution_count=1}\n``` {.julia .cell-code}\n# how to do it??\n# @filter(penguins, !!(my_column) .== \"Adelie\")\n```\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#42 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $my_column == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#44 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\nIn case the column is a string\n\n\n\n\n\n::: {#46 .cell execution_count=1}\n``` {.julia .cell-code}\nmy_column2 = \"species\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```\n\"species\"\n```\n:::\n:::\n\n\n\n\n\n\n\ninstead of a symbol, we can write\n\n::: {.panel-tabset}\n\n## DataFramesMeta\n\n\n\n\n\n::: {#48 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $(Symbol(my_column2)) == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#50 .cell execution_count=1}\n``` {.julia .cell-code}\nsubset(penguins, my_column2 => x -> x .== \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Arranging\n\nArranging is when we reorder the rows of a dataframe according to some criteria.\n\n\n\n\n\n::: {#52 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins body_mass_g\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1ChinstrapDream46.916.61922700female
2AdelieBiscoe36.516.61812850female
3AdelieBiscoe36.417.11842850female
4AdelieBiscoe34.518.11872900female
5AdelieDream33.116.11782900female
6AdelieTorgersen38.617.01882900female
7ChinstrapDream43.216.61872900female
8AdelieBiscoe37.918.61932925female
9AdelieDream37.518.91792975missing
10AdelieDream37.016.91853000female
11AdelieDream37.316.81923000female
12AdelieTorgersen35.916.61903050female
13AdelieTorgersen35.215.91863050female
333GentooBiscoe48.616.02305800male
334GentooBiscoe48.414.62135850male
335GentooBiscoe49.315.72175850male
336GentooBiscoe55.116.02305850male
337GentooBiscoe45.216.42235950male
338GentooBiscoe49.815.92295950male
339GentooBiscoe51.116.32206000male
340GentooBiscoe48.816.22226000male
341GentooBiscoe59.617.02306050male
342GentooBiscoe49.215.22216300male
343AdelieTorgersenmissingmissingmissingmissingmissing
344GentooBiscoemissingmissingmissingmissingmissing
\n```\n:::\n:::\n\n\n\n::: {#54 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins species body_mass_g\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieBiscoe36.516.61812850female
2AdelieBiscoe36.417.11842850female
3AdelieBiscoe34.518.11872900female
4AdelieDream33.116.11782900female
5AdelieTorgersen38.617.01882900female
6AdelieBiscoe37.918.61932925female
7AdelieDream37.518.91792975missing
8AdelieDream37.016.91853000female
9AdelieDream37.316.81923000female
10AdelieTorgersen35.916.61903050female
11AdelieTorgersen35.215.91863050female
12AdelieTorgersen39.017.11913050female
13AdelieDream32.115.51883050female
333GentooBiscoe49.516.22295800male
334GentooBiscoe48.616.02305800male
335GentooBiscoe48.414.62135850male
336GentooBiscoe49.315.72175850male
337GentooBiscoe55.116.02305850male
338GentooBiscoe45.216.42235950male
339GentooBiscoe49.815.92295950male
340GentooBiscoe51.116.32206000male
341GentooBiscoe48.816.22226000male
342GentooBiscoe59.617.02306050male
343GentooBiscoe49.215.22216300male
344GentooBiscoemissingmissingmissingmissingmissing
\n```\n:::\n:::\n\n\n\n::: {#56 .cell execution_count=1}\n``` {.julia .cell-code}\n@arrange penguins island desc(body_mass_g)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1GentooBiscoemissingmissingmissingmissingmissing
2GentooBiscoe49.215.22216300male
3GentooBiscoe59.617.02306050male
4GentooBiscoe51.116.32206000male
5GentooBiscoe48.816.22226000male
6GentooBiscoe45.216.42235950male
7GentooBiscoe49.815.92295950male
8GentooBiscoe48.414.62135850male
9GentooBiscoe49.315.72175850male
10GentooBiscoe55.116.02305850male
11GentooBiscoe49.516.22295800male
12GentooBiscoe48.616.02305800male
13GentooBiscoe50.415.72225750male
333AdelieTorgersen41.118.61893325male
334AdelieTorgersen38.517.91903325female
335AdelieTorgersen37.817.11863300missing
336AdelieTorgersen38.817.61913275female
337AdelieTorgersen40.318.01953250female
338AdelieTorgersen41.117.61823200female
339AdelieTorgersen34.617.21893200female
340AdelieTorgersen36.217.21873150female
341AdelieTorgersen35.916.61903050female
342AdelieTorgersen35.215.91863050female
343AdelieTorgersen39.017.11913050female
344AdelieTorgersen38.617.01882900female
\n```\n:::\n:::\n\n\n", + "supporting": [ + "dataframes-rows_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/_freeze/dataframes/execute-results/html.json b/_freeze/dataframes/execute-results/html.json index ea227e6..be0a4ce 100644 --- a/_freeze/dataframes/execute-results/html.json +++ b/_freeze/dataframes/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "103b5252701e836620eb447a28e1e311", + "hash": "455f47e9eeff41c1f1437673418aca7b", "result": { "engine": "julia", - "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
333ChinstrapDream45.216.61913250female
334ChinstrapDream49.319.92034050male
335ChinstrapDream50.218.82023800male
336ChinstrapDream45.619.41943525female
337ChinstrapDream51.919.52063950male
338ChinstrapDream46.816.51893650female
339ChinstrapDream45.717.01953650female
340ChinstrapDream55.819.82074000male
341ChinstrapDream43.518.12023400female
342ChinstrapDream49.618.21933775male
343ChinstrapDream50.819.02104100male
344ChinstrapDream50.218.71983775female
\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros based on DataFrames.\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them.\n:::\n\n## Operations\n\nLet's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology:\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nLet's see each operation with more details.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `select` | `@select` | `@select` | array sintax |\n| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\n", + "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier, Chain\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
333ChinstrapDream45.216.61913250female
334ChinstrapDream49.319.92034050male
335ChinstrapDream50.218.82023800male
336ChinstrapDream45.619.41943525female
337ChinstrapDream51.919.52063950male
338ChinstrapDream46.816.51893650female
339ChinstrapDream45.717.01953650female
340ChinstrapDream55.819.82074000male
341ChinstrapDream43.518.12023400female
342ChinstrapDream49.618.21933775male
343ChinstrapDream50.819.02104100male
344ChinstrapDream50.218.71983775female
\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros based on DataFrames.\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.\n:::\n\n## Operations\n\nLet's start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.^[Join operations will be dealt later.]. We can divide these operations in some categories:\n\n### Rows operations\n\nThese are operations that only affect rows, leaving all columns untouched.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\n### Column operations\n\nThese are operations that only affect columns, leaving all rows untouched.\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n### Reshaping operations\n\nThese operations change the shape of a dataframe, making it wider or longer.\n\n- `Widening`\n\n- `Longering`?\n\n### Grouping operations\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n### Mixed operations\n\nThese operations can possibly change rows and columns at the same time.\n\n- Distinct;\n- Counting;\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n??? deixar grupo e sumário juntos?\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nNow for binary operations (ie. operations that take two dataframes), we have all the joins:\n\n- Left join;\n- Right join;\n- Inner join;\n- Outer join;\n- Full join.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `filter` | `@filter` | `@subset` / `@rsubset` | `subset` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n| `select` | `@select` | `@select` | array sintax |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n\nIt is clear that for those coming from `R`, Tidier will look like the most natural approach.\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\nWe will see each operation with more details in the following chapters.\n\n## Chaining operations\n\nWe can chain (or pipe) dataframe operations as follows with the `@chain` macro:\n\n\n\n\n\n::: {#4 .cell execution_count=0}\n``` {.julia .cell-code}\n@chain penguins begin\n @filter !ismissing(sex)\n @group_by sex\n @summarise mean = mean(bill_length_mm)\n @arrange mean\nend\n```\n:::\n\n\n", "supporting": [ "dataframes_files" ], diff --git a/_quarto.yml b/_quarto.yml index b14d4f8..dbeb4e9 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -34,7 +34,9 @@ book: # - dataframes.qmd - part: dataframes.qmd chapters: - - dataframes-filtering.qmd + - dataframes-rows.qmd + # - dataframes-columns.qmd + # - dataframes-groups.qmd # - part: "Part 2: Dataframes" - part: "Part 3: Reading data" - part: "Part 4: Plotting data" diff --git a/dataframes-columns.qmd b/dataframes-columns.qmd new file mode 100644 index 0000000..1e67807 --- /dev/null +++ b/dataframes-columns.qmd @@ -0,0 +1,19 @@ +--- +# jupyter: julia-1.10 +engine: julia +--- + +## Operations on columns + +::: {.panel-tabset} + +## Tidier + +## DataFramesMeta + +## DataFrames + +::: + +## Conditionally mutating columns + diff --git a/dataframes-filtering.qmd b/dataframes-filtering.qmd deleted file mode 100644 index a0c616f..0000000 --- a/dataframes-filtering.qmd +++ /dev/null @@ -1,159 +0,0 @@ ---- -# jupyter: julia-1.10 -engine: julia ---- - -# Filtering - -```{julia} -using DataFrames, PalmerPenguins -using Tidier -import DataFramesMeta as DFM - -penguins = PalmerPenguins.load() |> DataFrame; -@slice_head(penguins, n = 15) -``` - -To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form - -```{julia} -@filter(penguins, species == "Adelie") -``` - -or without parentesis as in - -```{julia} -@filter penguins species == "Adelie" -``` - -Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe as columns. - -In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example: - -```{julia} -DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g)) -``` - -Notice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it). - -In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed: - -```{julia} -DFM.@rsubset penguins :species == "Adelie" -``` - -In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true. - -## Filtering with one criteria - -Filtering all the rows with `species` = "Adelie". - -::: {.panel-tabset} - -## Tidier - -```{julia} -@filter penguins species == "Adelie" -``` - -## DataFramesMeta - -```{julia} -DFM.@rsubset penguins :species == "Adelie" -``` - -## DataFrames - -```{julia} -filter(r -> r.species == "Adelie", penguins) -``` - -::: - -## Filtering with several criteria - -Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000. - -::: {.panel-tabset} - -## Tidier - -```{julia} -@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000 -``` - -## DataFramesMeta - -```{julia} -DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000 -``` - -## DataFrames - -```{julia} -filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins) -``` - -::: - - -Filtering all the rows where the `flipper_length_mm` is greater than the mean. - -::: {.panel-tabset} - -## Tidier - -```{julia} -@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm)) -``` - -## DataFramesMeta - -```{julia} -DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm)) -``` - -## DataFrames - -```{julia} -filter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins) -``` - -::: - -## Filtering with a variable column name - -Suppose the column you want to filter is a variable, let's say - -```{julia} -# filter_column = "species" -column_symbol = :species -``` - -::: {.panel-tabset} - -## Tidier - -```{julia} -# @chain penguins begin -# @filter(!!filter_column == "Adelie") -# # @select(!!filter_column) -# end -# @filter(penguins, !!filter_column == "Adelie") -``` - -## DataFramesMeta - -```{julia} -DFM.@rsubset penguins $column_symbol == "Adelie" -``` - -::: - -In case the column is a string instead of a symbol, we can write - -```{julia} -column_string = "species" - -DFM.@rsubset penguins $(Symbol(column_string)) == "Adelie" -``` \ No newline at end of file diff --git a/dataframes-mutating.qmd b/dataframes-groups.qmd similarity index 100% rename from dataframes-mutating.qmd rename to dataframes-groups.qmd diff --git a/dataframes-reshape.qmd b/dataframes-reshape.qmd new file mode 100644 index 0000000..e58845d --- /dev/null +++ b/dataframes-reshape.qmd @@ -0,0 +1,16 @@ +--- +# jupyter: julia-1.10 +engine: julia +--- + +## Creating columns + +::: {.panel-tabset} + +## Tidier + +## DataFramesMeta + +## DataFrames + +::: \ No newline at end of file diff --git a/dataframes-rows.qmd b/dataframes-rows.qmd new file mode 100644 index 0000000..6971f15 --- /dev/null +++ b/dataframes-rows.qmd @@ -0,0 +1,233 @@ +--- +# jupyter: julia-1.10 +engine: julia +--- + +# Operations on rows + +```{julia} +using DataFrames, PalmerPenguins +using Tidier +import DataFramesMeta as DFM + +penguins = PalmerPenguins.load() |> DataFrame; +@slice_head(penguins, n = 15) +``` + +## Filtering + +To filter is to keep only the rows that satisfy a certain criteria (ie. a boolean condition). + +To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form + +```{julia} +@filter(penguins, species == "Adelie") +``` + +or without parentesis as in + +```{julia} +@filter penguins species == "Adelie" +``` + +Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe as columns. + +In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses a whole column, for example: + +```{julia} +DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g)) +``` + +Notice the broadcast on >=. We need it because *each variable is interpreted as an array (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it). + +In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed: + +```{julia} +DFM.@rsubset penguins :species == "Adelie" +``` + +In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that `false` and `missing` are thrown away. + +In DataFrames, we use the `subset` function, and the criteria is passed with the notation + +```{julia} +#| eval: false + +subset(penguins, :column => boolean_function) + +``` + +where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable. Add the kwarg `skipmissing=true` if you want to get rid of missing values. + +### Filtering with one criteria + +Filtering all the rows with `species` = "Adelie". + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins species == "Adelie" +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins :species == "Adelie" +``` + +## DataFrames + +```{julia} +subset(penguins, :species => x -> x .== "Adelie", skipmissing=true) +``` + +::: + +### Filtering with several criteria + +Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000. + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000 +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000 +``` + +## DataFrames + +```{julia} +subset(penguins, [:species, :sex, :body_mass_g] => (x, y, z) -> (x .== "Adelie") .& (y .== "male") .& (z .> 4000), skipmissing=true) +``` + +::: + + +Filtering all the rows with `species` = "Adelie" OR `sex` = "male". + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins (species == "Adelie") | (sex == "male") +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins (:species == "Adelie") | (:sex == "male") +``` + +## DataFrames + +```{julia} +subset(penguins, [:species, :sex] => (x, y) -> (x .== "Adelie") .| (y .== "male"), skipmissing=true) +``` + +::: + + +Filtering all the rows where the `flipper_length_mm` is greater than the mean. + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm)) +``` + +## DataFramesMeta + +```{julia} +DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm)) +``` + +## DataFrames + +```{julia} +subset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true) +``` + +::: + +### Filtering with a variable column name + +Suppose the column you want to filter is a variable, let's say + +```{julia} +my_column = :species +``` + +::: {.panel-tabset} + +## Tidier + +```{julia} +# how to do it?? +# @filter(penguins, !!(my_column) .== "Adelie") +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins $my_column == "Adelie" +``` + +## DataFrames + +```{julia} +subset(penguins, my_column => x -> x .== "Adelie") +``` + +::: + +In case the column is a string + +```{julia} +my_column2 = "species" +``` + +instead of a symbol, we can write + +::: {.panel-tabset} + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins $(Symbol(my_column2)) == "Adelie" +``` + +## DataFrames + +```{julia} +subset(penguins, my_column2 => x -> x .== "Adelie") +``` + +::: + +## Arranging + +Arranging is when we reorder the rows of a dataframe according to some criteria. + +```{julia} +@arrange penguins body_mass_g +``` + +```{julia} +@arrange penguins species body_mass_g +``` + +```{julia} +@arrange penguins island desc(body_mass_g) +``` \ No newline at end of file diff --git a/dataframes.qmd b/dataframes.qmd index 0ce7baa..6f89bf3 100644 --- a/dataframes.qmd +++ b/dataframes.qmd @@ -10,6 +10,7 @@ Dataframes are one of the most important objects in data science. A dataframe is We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter. ```{julia} +#| eval: true using DataFrames, PalmerPenguins using Tidier, Chain import DataFramesMeta as DFM @@ -25,39 +26,73 @@ DataFramesMeta is a collection of macros based on DataFrames. Tidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this "tidy" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others). -In this book, whenever possible, we will show the different approaches in a tabset so you can compare them. +In this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier. ::: ## Operations -Let's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology: +Let's start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.^[Join operations will be dealt later.]. We can divide these operations in some categories: -- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns. +### Rows operations + +These are operations that only affect rows, leaving all columns untouched. - *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows. +- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria. + +### Column operations + +These are operations that only affect columns, leaving all rows untouched. + +- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns. + - *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000. +### Reshaping operations + +These operations change the shape of a dataframe, making it wider or longer. + +- `Widening` + +- `Longering`? + +### Grouping operations + - *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species. +### Mixed operations + +These operations can possibly change rows and columns at the same time. + +- Distinct; +- Counting; - *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups. -- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria. +??? deixar grupo e sumário juntos? Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups. +Now for binary operations (ie. operations that take two dataframes), we have all the joins: + +- Left join; +- Right join; +- Inner join; +- Outer join; +- Full join. + ## Comparing Tidier with DataFramesMeta The following table list the operations on each package: | dplyr | Tidier | DataFramesMeta | DataFrames | |-------------|--------------|------------------------------|--------------| +| `filter` | `@filter` | `@subset` / `@rsubset` | `subset` | +| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` | | `select` | `@select` | `@select` | array sintax | -| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` | | `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax | | `group_by` | `@group_by` | `@groupby` | `groupby` | | `summarise` | `@summarise` | `@combine` | `combine` | -| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` | It is clear that for those coming from `R`, Tidier will look like the most natural approach. @@ -70,6 +105,7 @@ We will see each operation with more details in the following chapters. We can chain (or pipe) dataframe operations as follows with the `@chain` macro: ```{julia} +#| eval: false @chain penguins begin @filter !ismissing(sex) @group_by sex diff --git a/docs/dataframes-rows.html b/docs/dataframes-rows.html new file mode 100644 index 0000000..b60d4d3 --- /dev/null +++ b/docs/dataframes-rows.html @@ -0,0 +1,7745 @@ + + + + + + + + + +1  Operations on rows – Tidier Data Science with Julia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + +
+ + + +
+ +
+
+

1  Operations on rows

+
+ + + +
+ + + + +
+ + + +
+ + +
+
using DataFrames, PalmerPenguins
+using Tidier
+import DataFramesMeta as DFM
+
+penguins = PalmerPenguins.load() |> DataFrame;
+@slice_head(penguins, n = 15)
+
+
15×7 DataFrame
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
14AdelieTorgersen38.621.21913800male
15AdelieTorgersen34.621.11984400male
+
+
+
+
+

1.1 Filtering

+

To filter is to keep only the rows that satisfy a certain criteria (ie. a boolean condition).

+

To filter a dataframe in Tidier, we use the macro @filter. You can use it in the form

+
+
@filter(penguins, species == "Adelie")
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

or without parentesis as in

+
+
@filter penguins species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the tidyverse behaviour of data-masking: inside a tidyverse verb, the columns are taken as “statistical variables” that exist inside the dataframe as columns.

+

In DataFramesMeta, we have two macros for filtering: @subset and @rsubset. Use the first when you have some criteria that uses a whole column, for example:

+
+
DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
+
+
149×7 DataFrame
124 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen42.020.21904250missing
3AdelieTorgersen34.621.11984400male
4AdelieTorgersen42.520.71974500male
5AdelieDream39.819.11844650male
6AdelieDream44.119.71964400male
7AdelieDream39.618.81904600male
8AdelieBiscoe40.118.91884300male
9AdelieBiscoe41.321.11954400male
10AdelieTorgersen41.819.41984450male
11AdelieTorgersen42.818.51954250male
12AdelieTorgersen42.917.61964700male
13AdelieDream41.118.12054300male
138GentooBiscoe47.213.72144925female
139GentooBiscoe46.814.32154850female
140GentooBiscoe50.415.72225750male
141GentooBiscoe45.214.82125200female
142GentooBiscoe49.916.12135400male
143ChinstrapDream49.218.21954400male
144ChinstrapDream52.820.02054550male
145ChinstrapDream54.220.82014300male
146ChinstrapDream52.020.72104800male
147ChinstrapDream53.519.92054500male
148ChinstrapDream50.818.52014450male
149ChinstrapDream49.019.62124300male
+
+
+
+

Notice the broadcast on >=. We need it because each variable is interpreted as an array (the whole column). Also, notice that we refer to columns as symbols (i.e. we append : to it).

+

In the above example, we needed the whole column body_mass_g to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then @rsubset (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:

+
+
DFM.@rsubset penguins :species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

In both Tidier and DataFramesMeta, only the rows to which the criteria is true are returned. This means that false and missing are thrown away.

+

In DataFrames, we use the subset function, and the criteria is passed with the notation

+
+
subset(penguins, :column => boolean_function)
+
+

where boolean_function is a boolean (with possibly missing values) function on 1 variable. Add the kwarg skipmissing=true if you want to get rid of missing values.

+
+

1.1.1 Filtering with one criteria

+

Filtering all the rows with species = “Adelie”.

+
+ +
+
+
+
@filter penguins species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
DFM.@rsubset penguins :species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
subset(penguins, :species => x -> x .== "Adelie", skipmissing=true)
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
+
+

1.1.2 Filtering with several criteria

+

Filtering all the rows with species = “Adelie”, sex = “male” and body_mass_g > 4000.

+
+ +
+
+
+
@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+
DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+
subset(penguins, [:species, :sex, :body_mass_g] => (x, y, z) -> (x .== "Adelie") .& (y .== "male") .& (z .> 4000), skipmissing=true)
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+

Filtering all the rows with species = “Adelie” OR sex = “male”.

+
+ +
+
+
+
@filter penguins (species == "Adelie") | (sex == "male")
+
+
247×7 DataFrame
222 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
DFM.@rsubset penguins (:species == "Adelie") | (:sex == "male")
+
+
247×7 DataFrame
222 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
subset(penguins, [:species, :sex] => (x, y) -> (x .== "Adelie") .| (y .== "male"), skipmissing=true)
+
+
247×7 DataFrame
222 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
236ChinstrapDream50.818.52014450male
237ChinstrapDream49.019.62124300male
238ChinstrapDream51.518.71873250male
239ChinstrapDream51.419.02013950male
240ChinstrapDream50.719.72034050male
241ChinstrapDream52.218.81973450male
242ChinstrapDream49.319.92034050male
243ChinstrapDream50.218.82023800male
244ChinstrapDream51.919.52063950male
245ChinstrapDream55.819.82074000male
246ChinstrapDream49.618.21933775male
247ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+

Filtering all the rows where the flipper_length_mm is greater than the mean.

+
+ +
+
+
+
@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
subset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissing=true)
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
+
+

1.1.3 Filtering with a variable column name

+

Suppose the column you want to filter is a variable, let’s say

+
+
my_column = :species
+
+
:species
+
+
+
+ +
+
+
+
# how to do it??
+# @filter(penguins, !!(my_column) .== "Adelie")
+
+
+
+
+
DFM.@rsubset penguins $my_column == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
subset(penguins, my_column => x -> x .== "Adelie")
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+

In case the column is a string

+
+
my_column2 = "species"
+
+
"species"
+
+
+

instead of a symbol, we can write

+
+ +
+
+
+
DFM.@rsubset penguins $(Symbol(my_column2)) == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
subset(penguins, my_column2 => x -> x .== "Adelie")
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
+
+
+

1.2 Arranging

+

Arranging is when we reorder the rows of a dataframe according to some criteria.

+
+
@arrange penguins body_mass_g
+
+
344×7 DataFrame
319 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1ChinstrapDream46.916.61922700female
2AdelieBiscoe36.516.61812850female
3AdelieBiscoe36.417.11842850female
4AdelieBiscoe34.518.11872900female
5AdelieDream33.116.11782900female
6AdelieTorgersen38.617.01882900female
7ChinstrapDream43.216.61872900female
8AdelieBiscoe37.918.61932925female
9AdelieDream37.518.91792975missing
10AdelieDream37.016.91853000female
11AdelieDream37.316.81923000female
12AdelieTorgersen35.916.61903050female
13AdelieTorgersen35.215.91863050female
333GentooBiscoe48.616.02305800male
334GentooBiscoe48.414.62135850male
335GentooBiscoe49.315.72175850male
336GentooBiscoe55.116.02305850male
337GentooBiscoe45.216.42235950male
338GentooBiscoe49.815.92295950male
339GentooBiscoe51.116.32206000male
340GentooBiscoe48.816.22226000male
341GentooBiscoe59.617.02306050male
342GentooBiscoe49.215.22216300male
343AdelieTorgersenmissingmissingmissingmissingmissing
344GentooBiscoemissingmissingmissingmissingmissing
+
+
+
+
+
@arrange penguins species body_mass_g
+
+
344×7 DataFrame
319 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieBiscoe36.516.61812850female
2AdelieBiscoe36.417.11842850female
3AdelieBiscoe34.518.11872900female
4AdelieDream33.116.11782900female
5AdelieTorgersen38.617.01882900female
6AdelieBiscoe37.918.61932925female
7AdelieDream37.518.91792975missing
8AdelieDream37.016.91853000female
9AdelieDream37.316.81923000female
10AdelieTorgersen35.916.61903050female
11AdelieTorgersen35.215.91863050female
12AdelieTorgersen39.017.11913050female
13AdelieDream32.115.51883050female
333GentooBiscoe49.516.22295800male
334GentooBiscoe48.616.02305800male
335GentooBiscoe48.414.62135850male
336GentooBiscoe49.315.72175850male
337GentooBiscoe55.116.02305850male
338GentooBiscoe45.216.42235950male
339GentooBiscoe49.815.92295950male
340GentooBiscoe51.116.32206000male
341GentooBiscoe48.816.22226000male
342GentooBiscoe59.617.02306050male
343GentooBiscoe49.215.22216300male
344GentooBiscoemissingmissingmissingmissingmissing
+
+
+
+
+
@arrange penguins island desc(body_mass_g)
+
+
344×7 DataFrame
319 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1GentooBiscoemissingmissingmissingmissingmissing
2GentooBiscoe49.215.22216300male
3GentooBiscoe59.617.02306050male
4GentooBiscoe51.116.32206000male
5GentooBiscoe48.816.22226000male
6GentooBiscoe45.216.42235950male
7GentooBiscoe49.815.92295950male
8GentooBiscoe48.414.62135850male
9GentooBiscoe49.315.72175850male
10GentooBiscoe55.116.02305850male
11GentooBiscoe49.516.22295800male
12GentooBiscoe48.616.02305800male
13GentooBiscoe50.415.72225750male
333AdelieTorgersen41.118.61893325male
334AdelieTorgersen38.517.91903325female
335AdelieTorgersen37.817.11863300missing
336AdelieTorgersen38.817.61913275female
337AdelieTorgersen40.318.01953250female
338AdelieTorgersen41.117.61823200female
339AdelieTorgersen34.617.21893200female
340AdelieTorgersen36.217.21873150female
341AdelieTorgersen35.916.61903050female
342AdelieTorgersen35.215.91863050female
343AdelieTorgersen39.017.11913050female
344AdelieTorgersen38.617.01882900female
+
+
+
+ + +
+ +
+ + +
+ + + + + + \ No newline at end of file diff --git a/docs/dataframes.html b/docs/dataframes.html index 47eab8e..3966e0d 100644 --- a/docs/dataframes.html +++ b/docs/dataframes.html @@ -64,7 +64,7 @@ - + @@ -172,8 +172,8 @@ @@ -213,8 +213,16 @@

Table of contents

@@ -244,7 +252,7 @@

Part 2: Dataframes

We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.

using DataFrames, PalmerPenguins
-using Tidier
+using Tidier, Chain
 import DataFramesMeta as DFM
 
 penguins = PalmerPenguins.load() |> DataFrame
@@ -548,23 +556,62 @@

Part 2: Dataframes

Dataframes.jl is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.

DataFramesMeta is a collection of macros based on DataFrames.

Tidier is inspired by the tidyverse ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this “tidy” heritance, we will often talk about the R packages that inspired the Julia ones (like dplyr, tidyr and many others).

-

In this book, whenever possible, we will show the different approaches in a tabset so you can compare them.

+

In this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.

Operations

-

Let’s start with some operations that take only one dataframe as input.1. Here is the basic terminology:

+

Let’s start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.1. We can divide these operations in some categories:

+
+

Rows operations

+

These are operations that only affect rows, leaving all columns untouched.

+
+
+

Column operations

+

These are operations that only affect columns, leaving all rows untouched.

+ +
+
+

Reshaping operations

+

These operations change the shape of a dataframe, making it wider or longer.

+ +
+
+

Grouping operations

+ +
+
+

Mixed operations

+

These operations can possibly change rows and columns at the same time.

+ +

??? deixar grupo e sumário juntos?

Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.

-

Let’s see each operation with more details.

+

Now for binary operations (ie. operations that take two dataframes), we have all the joins:

+ +

Comparing Tidier with DataFramesMeta

@@ -586,44 +633,58 @@

Compa +filter +@filter +@subset / @rsubset +subset + + +arrange +@arrange +@orderby / @rorderby +sort! + + select @select @select array sintax -filter -@filter -@subset / @rsubset -filter - - mutate @mutate @transform / @rtransform array sintax - + group_by @group_by @groupby groupby - + summarise @summarise @combine combine - -arrange -@arrange -@orderby / @rorderby -sort! - +

It is clear that for those coming from R, Tidier will look like the most natural approach.

Notice that we have a name clash with @select: that is why we import DataFramesMeta as DFM at the beginning.

+

We will see each operation with more details in the following chapters.

+

+
+

Chaining operations

+

We can chain (or pipe) dataframe operations as follows with the @chain macro:

+
+
@chain penguins begin
+    @filter !ismissing(sex)
+    @group_by sex
+    @summarise mean = mean(bill_length_mm)
+    @arrange mean
+end
+
@@ -1062,8 +1123,8 @@

Compa diff --git a/docs/index.html b/docs/index.html index f333ebf..bcb5182 100644 --- a/docs/index.html +++ b/docs/index.html @@ -119,6 +119,7 @@