diff --git a/_freeze/dataframes-filtering/execute-results/html.json b/_freeze/dataframes-filtering/execute-results/html.json new file mode 100644 index 0000000..f54a557 --- /dev/null +++ b/_freeze/dataframes-filtering/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "624eb1f4f3a818d3fe8104583f3d8cca", + "result": { + "engine": "julia", + "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Filtering\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame;\n@slice_head(penguins, n = 15)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
15×7 DataFrame
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
14AdelieTorgersen38.621.21913800male
15AdelieTorgersen34.621.11984400male
\n```\n:::\n:::\n\n\n\n\n\n\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe as columns.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen42.020.21904250missing
3AdelieTorgersen34.621.11984400male
4AdelieTorgersen42.520.71974500male
5AdelieDream39.819.11844650male
6AdelieDream44.119.71964400male
7AdelieDream39.618.81904600male
8AdelieBiscoe40.118.91884300male
9AdelieBiscoe41.321.11954400male
10AdelieTorgersen41.819.41984450male
11AdelieTorgersen42.818.51954250male
12AdelieTorgersen42.917.61964700male
13AdelieDream41.118.12054300male
138GentooBiscoe47.213.72144925female
139GentooBiscoe46.814.32154850female
140GentooBiscoe50.415.72225750male
141GentooBiscoe45.214.82125200female
142GentooBiscoe49.916.12135400male
143ChinstrapDream49.218.21954400male
144ChinstrapDream52.820.02054550male
145ChinstrapDream54.220.82014300male
146ChinstrapDream52.020.72104800male
147ChinstrapDream53.519.92054500male
148ChinstrapDream50.818.52014450male
149ChinstrapDream49.019.62124300male
\n```\n:::\n:::\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).\n\nIn the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n\n## Filtering with one criteria\n\nFiltering all the rows with `species` = \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#12 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#14 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#16 .cell execution_count=1}\n``` {.julia .cell-code}\nfilter(r -> r.species == \"Adelie\", penguins)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Filtering with several criteria\n\nFiltering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#20 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#22 .cell execution_count=1}\n``` {.julia .cell-code}\nfilter(r -> ((r.species == \"Adelie\") & (r.sex == \"male\") & (r.body_mass_g > 4000)) === true, penguins)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n\nFiltering all the rows where the `flipper_length_mm` is greater than the mean.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#24 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#26 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n::: {#28 .cell execution_count=1}\n``` {.julia .cell-code}\nfilter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
148×7 DataFrame
123 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\n## Filtering with a variable column name\n\nSuppose the column you want to filter is a variable, let's say\n\n\n\n\n\n::: {#30 .cell execution_count=1}\n``` {.julia .cell-code}\n# filter_column = \"species\"\ncolumn_symbol = :species\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```\n:species\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n::: {#32 .cell execution_count=1}\n``` {.julia .cell-code}\n# @chain penguins begin\n# @filter(!!filter_column == \"Adelie\")\n# # @select(!!filter_column)\n# end\n# @filter(penguins, !!filter_column == \"Adelie\")\n```\n:::\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n::: {#34 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins $column_symbol == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n:::\n\nIn case the column is a string instead of a symbol, we can write\n\n\n\n\n\n::: {#36 .cell execution_count=1}\n``` {.julia .cell-code}\ncolumn_string = \"species\"\n\nDFM.@rsubset penguins $(Symbol(column_string)) == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n", + "supporting": [ + "dataframes-filtering_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/_freeze/dataframes/execute-results/html.json b/_freeze/dataframes/execute-results/html.json index 5338562..ea227e6 100644 --- a/_freeze/dataframes/execute-results/html.json +++ b/_freeze/dataframes/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "07497448379d045b81637577fc049da2", + "hash": "103b5252701e836620eb447a28e1e311", "result": { "engine": "julia", - "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n\n# Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
333ChinstrapDream45.216.61913250female
334ChinstrapDream49.319.92034050male
335ChinstrapDream50.218.82023800male
336ChinstrapDream45.619.41943525female
337ChinstrapDream51.919.52063950male
338ChinstrapDream46.816.51893650female
339ChinstrapDream45.717.01953650female
340ChinstrapDream55.819.82074000male
341ChinstrapDream43.518.12023400female
342ChinstrapDream49.618.21933775male
343ChinstrapDream50.819.02104100male
344ChinstrapDream50.218.71983775female
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros \n\nTidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.\n\nIn this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!\n:::\n\n## Operations\n\nIn this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nLet's see each operation with more details.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `select` | `@select` | `@select` | array sintax |\n| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\n## Filtering / subsetting\n\nTo filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n\n\n\n\n\n\n::: {#4 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter(penguins, species == \"Adelie\")\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nor without parentesis as in \n\n\n\n\n\n\n::: {#6 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nNotice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe.\n\nIn DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n\n\n\n\n\n\n::: {#8 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
149×7 DataFrame
124 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen42.020.21904250missing
3AdelieTorgersen34.621.11984400male
4AdelieTorgersen42.520.71974500male
5AdelieDream39.819.11844650male
6AdelieDream44.119.71964400male
7AdelieDream39.618.81904600male
8AdelieBiscoe40.118.91884300male
9AdelieBiscoe41.321.11954400male
10AdelieTorgersen41.819.41984450male
11AdelieTorgersen42.818.51954250male
12AdelieTorgersen42.917.61964700male
13AdelieDream41.118.12054300male
138GentooBiscoe47.213.72144925female
139GentooBiscoe46.814.32154850female
140GentooBiscoe50.415.72225750male
141GentooBiscoe45.214.82125200female
142GentooBiscoe49.916.12135400male
143ChinstrapDream49.218.21954400male
144ChinstrapDream52.820.02054550male
145ChinstrapDream54.220.82014300male
146ChinstrapDream52.020.72104800male
147ChinstrapDream53.519.92054500male
148ChinstrapDream50.818.52014450male
149ChinstrapDream49.019.62124300male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nNotice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).\n\nIn this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n\n\n\n\n\n\n::: {#10 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\nIn both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n\n### Filtering with one criteria\n\nFiltering all the rows with `species` = \"Adelie\".\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n\n::: {#12 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n\n::: {#14 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\"\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n\n::: {#16 .cell execution_count=1}\n``` {.julia .cell-code}\nfilter(r -> r.species == \"Adelie\", penguins)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
152×7 DataFrame
127 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n### Filtering with several criteria\n\nFiltering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n\n::: {.panel-tabset}\n\n## Tidier\n\n\n\n\n\n\n::: {#18 .cell execution_count=1}\n``` {.julia .cell-code}\n@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n## DataFramesMeta\n\n\n\n\n\n\n::: {#20 .cell execution_count=1}\n``` {.julia .cell-code}\nDFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n## DataFrames\n\n\n\n\n\n\n::: {#22 .cell execution_count=1}\n``` {.julia .cell-code}\nfilter(r -> ((r.species == \"Adelie\") & (r.sex == \"male\") & (r.body_mass_g > 4000)) === true, penguins)\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
34×7 DataFrame
9 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
\n```\n:::\n:::\n\n\n\n\n\n\n\n\n:::\n\n\n## Creating columns\n\n::: {.panel-tabset}\n\n## Tidier\n\n## DataFramesMeta\n\n## DataFrames\n\n:::\n\n", + "markdown": "---\n# jupyter: julia-1.10\nengine: julia\n---\n\n\n\n\n\n# Part 2: Dataframes\n\nDataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n\nWe will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n\n\n\n\n\n::: {#2 .cell execution_count=1}\n``` {.julia .cell-code}\nusing DataFrames, PalmerPenguins\nusing Tidier\nimport DataFramesMeta as DFM\n\npenguins = PalmerPenguins.load() |> DataFrame\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
344×7 DataFrame
319 rows omitted
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
333ChinstrapDream45.216.61913250female
334ChinstrapDream49.319.92034050male
335ChinstrapDream50.218.82023800male
336ChinstrapDream45.619.41943525female
337ChinstrapDream51.919.52063950male
338ChinstrapDream46.816.51893650female
339ChinstrapDream45.717.01953650female
340ChinstrapDream55.819.82074000male
341ChinstrapDream43.518.12023400female
342ChinstrapDream49.618.21933775male
343ChinstrapDream50.819.02104100male
344ChinstrapDream50.218.71983775female
\n```\n:::\n:::\n\n\n\n\n\n\n\n::: {.callout-note}\n\n`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n\nDataFramesMeta is a collection of macros based on DataFrames.\n\nTidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this \"tidy\" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).\n\nIn this book, whenever possible, we will show the different approaches in a tabset so you can compare them.\n:::\n\n## Operations\n\nLet's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology:\n\n- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n\n- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n\n- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.\n\n- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n\n- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n\n- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n\nSince all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n\nLet's see each operation with more details.\n\n## Comparing Tidier with DataFramesMeta\n\nThe following table list the operations on each package:\n\n| dplyr | Tidier | DataFramesMeta | DataFrames |\n|-------------|--------------|------------------------------|--------------|\n| `select` | `@select` | `@select` | array sintax |\n| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n| `group_by` | `@group_by` | `@groupby` | `groupby` |\n| `summarise` | `@summarise` | `@combine` | `combine` |\n| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n\n\nNotice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.\n\n", "supporting": [ "dataframes_files" ], diff --git a/_quarto.yml b/_quarto.yml index 2412282..b14d4f8 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -16,14 +16,26 @@ book: title: "Tidier Data Science with Julia" author: "Guilherme Vituri and Christoph Scheuch" date: "15/08/2024" + repo-url: https://github.com/vituri/TidierBook2 + + page-navigation: true reader-mode: true + page-footer: + left: | + This book is part of the Tidier organization, bringing joy to + data science in Julia. + right: | + This book was built with Quarto. chapters: - index.qmd - part: "Part 1: Julia basics" + # chapters: + # - dataframes.qmd + - part: dataframes.qmd chapters: - - dataframes.qmd - - part: "Part 2: Manipulating data" + - dataframes-filtering.qmd + # - part: "Part 2: Dataframes" - part: "Part 3: Reading data" - part: "Part 4: Plotting data" - part: "Part 5: Applications" diff --git a/dataframes-filtering.qmd b/dataframes-filtering.qmd new file mode 100644 index 0000000..a0c616f --- /dev/null +++ b/dataframes-filtering.qmd @@ -0,0 +1,159 @@ +--- +# jupyter: julia-1.10 +engine: julia +--- + +# Filtering + +```{julia} +using DataFrames, PalmerPenguins +using Tidier +import DataFramesMeta as DFM + +penguins = PalmerPenguins.load() |> DataFrame; +@slice_head(penguins, n = 15) +``` + +To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form + +```{julia} +@filter(penguins, species == "Adelie") +``` + +or without parentesis as in + +```{julia} +@filter penguins species == "Adelie" +``` + +Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe as columns. + +In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example: + +```{julia} +DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g)) +``` + +Notice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it). + +In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed: + +```{julia} +DFM.@rsubset penguins :species == "Adelie" +``` + +In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true. + +## Filtering with one criteria + +Filtering all the rows with `species` = "Adelie". + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins species == "Adelie" +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins :species == "Adelie" +``` + +## DataFrames + +```{julia} +filter(r -> r.species == "Adelie", penguins) +``` + +::: + +## Filtering with several criteria + +Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000. + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000 +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000 +``` + +## DataFrames + +```{julia} +filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins) +``` + +::: + + +Filtering all the rows where the `flipper_length_mm` is greater than the mean. + +::: {.panel-tabset} + +## Tidier + +```{julia} +@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm)) +``` + +## DataFramesMeta + +```{julia} +DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm)) +``` + +## DataFrames + +```{julia} +filter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins) +``` + +::: + +## Filtering with a variable column name + +Suppose the column you want to filter is a variable, let's say + +```{julia} +# filter_column = "species" +column_symbol = :species +``` + +::: {.panel-tabset} + +## Tidier + +```{julia} +# @chain penguins begin +# @filter(!!filter_column == "Adelie") +# # @select(!!filter_column) +# end +# @filter(penguins, !!filter_column == "Adelie") +``` + +## DataFramesMeta + +```{julia} +DFM.@rsubset penguins $column_symbol == "Adelie" +``` + +::: + +In case the column is a string instead of a symbol, we can write + +```{julia} +column_string = "species" + +DFM.@rsubset penguins $(Symbol(column_string)) == "Adelie" +``` \ No newline at end of file diff --git a/dataframes-mutating.qmd b/dataframes-mutating.qmd new file mode 100644 index 0000000..e58845d --- /dev/null +++ b/dataframes-mutating.qmd @@ -0,0 +1,16 @@ +--- +# jupyter: julia-1.10 +engine: julia +--- + +## Creating columns + +::: {.panel-tabset} + +## Tidier + +## DataFramesMeta + +## DataFrames + +::: \ No newline at end of file diff --git a/dataframes.qmd b/dataframes.qmd index 36c7526..0ce7baa 100644 --- a/dataframes.qmd +++ b/dataframes.qmd @@ -3,7 +3,7 @@ engine: julia --- -# Dataframes +# Part 2: Dataframes Dataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable. @@ -11,7 +11,7 @@ We will use the Palmer Penguin dataset as a toy example for the remaining of the ```{julia} using DataFrames, PalmerPenguins -using Tidier +using Tidier, Chain import DataFramesMeta as DFM penguins = PalmerPenguins.load() |> DataFrame @@ -21,33 +21,31 @@ penguins = PalmerPenguins.load() |> DataFrame `Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. -DataFramesMeta is a collection of macros +DataFramesMeta is a collection of macros based on DataFrames. -Tidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code. +Tidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this "tidy" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others). -In this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them! +In this book, whenever possible, we will show the different approaches in a tabset so you can compare them. ::: ## Operations -In this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later. +Let's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology: - *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns. - *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows. -- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000. +- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000. - *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species. -- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups. +- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups. - *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria. Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups. -Let's see each operation with more details. - ## Comparing Tidier with DataFramesMeta The following table list the operations on each package: @@ -61,102 +59,21 @@ The following table list the operations on each package: | `summarise` | `@summarise` | `@combine` | `combine` | | `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` | +It is clear that for those coming from `R`, Tidier will look like the most natural approach. Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning. -## Filtering / subsetting - -To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form +We will see each operation with more details in the following chapters. -```{julia} -@filter(penguins, species == "Adelie") -``` +## Chaining operations -or without parentesis as in +We can chain (or pipe) dataframe operations as follows with the `@chain` macro: ```{julia} -@filter penguins species == "Adelie" -``` - -Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe. - -In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example: - -```{julia} -DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g)) -``` - -Notice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it). - -In this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed: - -```{julia} -DFM.@rsubset penguins :species == "Adelie" -``` - -In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true. - -### Filtering with one criteria - -Filtering all the rows with `species` = "Adelie". - -::: {.panel-tabset} - -## Tidier - -```{julia} -@filter penguins species == "Adelie" -``` - -## DataFramesMeta - -```{julia} -DFM.@rsubset penguins :species == "Adelie" -``` - -## DataFrames - -```{julia} -filter(r -> r.species == "Adelie", penguins) -``` - -::: - -### Filtering with several criteria - -Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000. - -::: {.panel-tabset} - -## Tidier - -```{julia} -@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000 -``` - -## DataFramesMeta - -```{julia} -DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000 -``` - -## DataFrames - -```{julia} -filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins) -``` - -::: - - -## Creating columns - -::: {.panel-tabset} - -## Tidier - -## DataFramesMeta - -## DataFrames - -::: \ No newline at end of file +@chain penguins begin + @filter !ismissing(sex) + @group_by sex + @summarise mean = mean(bill_length_mm) + @arrange mean +end +``` \ No newline at end of file diff --git a/dataframes.quarto_ipynb b/dataframes.quarto_ipynb deleted file mode 100644 index 89bc9f7..0000000 --- a/dataframes.quarto_ipynb +++ /dev/null @@ -1,319 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "jupyter: julia-1.10\n", - "---\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "# Dataframes\n", - "\n", - "Dataframes are on of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.\n", - "\n", - "We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.\n" - ], - "id": "ff50af05" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "using DataFrames, PalmerPenguins\n", - "using Tidier\n", - "import DataFramesMeta as DFM\n", - "\n", - "penguins = PalmerPenguins.load() |> DataFrame" - ], - "id": "68ceafc8", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.callout-note}\n", - "\n", - "`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier. \n", - "\n", - "DataFramesMeta is a collection of macros \n", - "\n", - "Tidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.\n", - "\n", - "In this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!\n", - ":::\n", - "\n", - "## Operations\n", - "\n", - "In this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.\n", - "\n", - "- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.\n", - "\n", - "- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.\n", - "\n", - "- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.\n", - "\n", - "- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.\n", - "\n", - "- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.\n", - "\n", - "- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.\n", - "\n", - "Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.\n", - "\n", - "Let's see each operation with more details.\n", - "\n", - "## Comparing Tidier with DataFramesMeta\n", - "\n", - "The following table list the operations on each package:\n", - "\n", - "| dplyr | Tidier | DataFramesMeta | DataFrames |\n", - "|-------------|--------------|------------------------------|--------------|\n", - "| `select` | `@select` | `@select` | array sintax |\n", - "| `filter` | `@filter` | `@subset` / `@rsubset` | `filter` |\n", - "| `mutate` | `@mutate` | `@transform` / `@rtransform` | array sintax |\n", - "| `group_by` | `@group_by` | `@groupby` | `groupby` |\n", - "| `summarise` | `@summarise` | `@combine` | `combine` |\n", - "| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |\n", - "\n", - "\n", - "Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM`.\n", - "\n", - "## Filtering / subsetting\n", - "\n", - "To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form\n" - ], - "id": "b62dba04" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "@filter(penguins, species == \"Adelie\")" - ], - "id": "aa87bf63", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "or without parentesis as in \n" - ], - "id": "e54568e0" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "@filter penguins species == \"Adelie\"" - ], - "id": "447cc245", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as \"statistical variables\" that exist inside the dataframe.\n", - "\n", - "In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:\n" - ], - "id": "1ac1b7a6" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))" - ], - "id": "996bd089", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).\n", - "\n", - "In this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:\n" - ], - "id": "c1fa3b5d" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "DFM.@rsubset penguins :species == \"Adelie\"" - ], - "id": "2bfff26f", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.\n", - "\n", - "### Filtering with one criteria\n", - "\n", - "Filtering all the rows with `species` = \"Adelie\".\n", - "\n", - "::: {.panel-tabset}\n", - "\n", - "## Tidier\n" - ], - "id": "3ab3f109" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "@filter penguins species == \"Adelie\"" - ], - "id": "145b4dbe", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## DataFramesMeta\n" - ], - "id": "ec5ac109" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "DFM.@rsubset penguins :species == \"Adelie\"" - ], - "id": "23034937", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## DataFrames\n" - ], - "id": "2e1677c6" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "filter(r -> r.species == \"Adelie\", penguins)" - ], - "id": "db6373f2", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ":::\n", - "\n", - "### Filtering with several criteria\n", - "\n", - "Filtering all the rows with `species` = \"Adelie\", `sex` = \"male\" and `body_mass_g` > 4000.\n", - "\n", - "::: {.panel-tabset}\n", - "\n", - "## Tidier\n" - ], - "id": "34324ff3" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "@filter penguins species == \"Adelie\" sex == \"male\" body_mass_g > 4000" - ], - "id": "d09aae2f", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## DataFramesMeta\n" - ], - "id": "b62d6a8f" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "DFM.@rsubset penguins :species == \"Adelie\" :sex == \"male\" :body_mass_g > 4000" - ], - "id": "859fc6c6", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## DataFrames\n" - ], - "id": "94de5470" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "filter(r -> ((r.species == \"Adelie\") & (r.sex == \"male\") & (r.body_mass_g > 4000)) == true, penguins)" - ], - "id": "8b2476e0", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ":::\n", - "\n", - "\n", - "## Creating columns\n", - "\n", - "::: {.panel-tabset}\n", - "\n", - "## Tidier\n", - "\n", - "## DataFramesMeta\n", - "\n", - "## DataFrames\n", - "\n", - ":::" - ], - "id": "34a91faa" - } - ], - "metadata": { - "kernelspec": { - "name": "julia-1.10", - "language": "julia", - "display_name": "Julia 1.10.4", - "path": "/home/vituri/.local/share/jupyter/kernels/julia-1.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/docs/.nojekyll b/docs/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/docs/dataframes-filtering.html b/docs/dataframes-filtering.html new file mode 100644 index 0000000..94ff961 --- /dev/null +++ b/docs/dataframes-filtering.html @@ -0,0 +1,5360 @@ + + + + + + + + + +1  Filtering – Tidier Data Science with Julia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + +
+ + + +
+ +
+
+

1  Filtering

+
+ + + +
+ + + + +
+ + + +
+ + +
+
using DataFrames, PalmerPenguins
+using Tidier
+import DataFramesMeta as DFM
+
+penguins = PalmerPenguins.load() |> DataFrame;
+@slice_head(penguins, n = 15)
+
+
15×7 DataFrame
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
14AdelieTorgersen38.621.21913800male
15AdelieTorgersen34.621.11984400male
+
+
+
+

To filter a dataframe in Tidier, we use the macro @filter. You can use it in the form

+
+
@filter(penguins, species == "Adelie")
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

or without parentesis as in

+
+
@filter penguins species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the tidyverse behaviour of data-masking: inside a tidyverse verb, the columns are taken as “statistical variables” that exist inside the dataframe as columns.

+

In DataFramesMeta, we have two macros for filtering: @subset and @rsubset. Use the first when you have some criteria that uses the whole dataframe, for example:

+
+
DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
+
+
149×7 DataFrame
124 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen42.020.21904250missing
3AdelieTorgersen34.621.11984400male
4AdelieTorgersen42.520.71974500male
5AdelieDream39.819.11844650male
6AdelieDream44.119.71964400male
7AdelieDream39.618.81904600male
8AdelieBiscoe40.118.91884300male
9AdelieBiscoe41.321.11954400male
10AdelieTorgersen41.819.41984450male
11AdelieTorgersen42.818.51954250male
12AdelieTorgersen42.917.61964700male
13AdelieDream41.118.12054300male
138GentooBiscoe47.213.72144925female
139GentooBiscoe46.814.32154850female
140GentooBiscoe50.415.72225750male
141GentooBiscoe45.214.82125200female
142GentooBiscoe49.916.12135400male
143ChinstrapDream49.218.21954400male
144ChinstrapDream52.820.02054550male
145ChinstrapDream54.220.82014300male
146ChinstrapDream52.020.72104800male
147ChinstrapDream53.519.92054500male
148ChinstrapDream50.818.52014450male
149ChinstrapDream49.019.62124300male
+
+
+
+

Notice the broadcast on >=. We need it because each row is interpreted as an array. Also, notice that we refer to columns as symbols (i.e. we append : to it).

+

In the above example, we needed the whole column body_mass_g to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then @rsubset (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:

+
+
DFM.@rsubset penguins :species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+

In both Tidier and DataFramesMeta, only the rows to which the criteria is true are returned. This means that you don’t need to worry about missing values in cases where the criteria do not return false nor `true.

+
+

1.1 Filtering with one criteria

+

Filtering all the rows with species = “Adelie”.

+
+ +
+
+
+
@filter penguins species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
DFM.@rsubset penguins :species == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
filter(r -> r.species == "Adelie", penguins)
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+
+
+

1.2 Filtering with several criteria

+

Filtering all the rows with species = “Adelie”, sex = “male” and body_mass_g > 4000.

+
+ +
+
+
+
@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+
DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+
filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins)
+
+
34×7 DataFrame
9 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.219.61954675male
2AdelieTorgersen34.621.11984400male
3AdelieTorgersen42.520.71974500male
4AdelieTorgersen46.021.51944200male
5AdelieDream39.221.11964150male
6AdelieDream39.819.11844650male
7AdelieDream44.119.71964400male
8AdelieDream39.618.81904600male
9AdelieDream42.321.21914150male
10AdelieBiscoe40.118.91884300male
11AdelieBiscoe42.019.52004050male
12AdelieBiscoe41.321.11954400male
13AdelieBiscoe41.118.21924050male
23AdelieDream40.318.51964350male
24AdelieDream43.218.51924100male
25AdelieBiscoe41.020.02034725male
26AdelieBiscoe37.820.01904250male
27AdelieBiscoe43.219.01974775male
28AdelieBiscoe45.620.31914600male
29AdelieBiscoe42.219.51974275male
30AdelieBiscoe42.718.31964075male
31AdelieTorgersen41.518.31954300male
32AdelieDream37.518.51994475male
33AdelieDream39.717.91934250male
34AdelieDream39.218.61904250male
+
+
+
+
+
+
+

Filtering all the rows where the flipper_length_mm is greater than the mean.

+
+ +
+
+
+
@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
filter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins)
+
+
148×7 DataFrame
123 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieDream35.718.02023550female
2AdelieDream41.118.12054300male
3AdelieDream40.818.92084300male
4AdelieBiscoe41.020.02034725male
5AdelieTorgersen41.418.52023875male
6AdelieTorgersen44.118.02104000male
7AdelieDream41.518.52014000male
8GentooBiscoe46.113.22114500female
9GentooBiscoe50.016.32305700male
10GentooBiscoe48.714.12104450female
11GentooBiscoe50.015.22185700male
12GentooBiscoe47.614.52155400male
13GentooBiscoe46.513.52104550female
137ChinstrapDream53.519.92054500male
138ChinstrapDream49.019.52103950male
139ChinstrapDream50.818.52014450male
140ChinstrapDream49.019.62124300male
141ChinstrapDream51.419.02013950male
142ChinstrapDream50.719.72034050male
143ChinstrapDream49.319.92034050male
144ChinstrapDream50.218.82023800male
145ChinstrapDream51.919.52063950male
146ChinstrapDream55.819.82074000male
147ChinstrapDream43.518.12023400female
148ChinstrapDream50.819.02104100male
+
+
+
+
+
+
+
+
+

1.3 Filtering with a variable column name

+

Suppose the column you want to filter is a variable, let’s say

+
+
# filter_column = "species"
+column_symbol = :species
+
+
:species
+
+
+
+ +
+
+
+
# @chain penguins begin
+#     @filter(!!filter_column == "Adelie")
+#     # @select(!!filter_column)
+# end
+# @filter(penguins, !!filter_column == "Adelie")
+
+
+
+
+
DFM.@rsubset penguins $column_symbol == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+
+
+
+

In case the column is a string instead of a symbol, we can write

+
+
column_string = "species"
+
+DFM.@rsubset penguins $(Symbol(column_string)) == "Adelie"
+
+
152×7 DataFrame
127 rows omitted
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rowspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
String15String15Float64?Float64?Int64?Int64?String7
1AdelieTorgersen39.118.71813750male
2AdelieTorgersen39.517.41863800female
3AdelieTorgersen40.318.01953250female
4AdelieTorgersenmissingmissingmissingmissingmissing
5AdelieTorgersen36.719.31933450female
6AdelieTorgersen39.320.61903650male
7AdelieTorgersen38.917.81813625female
8AdelieTorgersen39.219.61954675male
9AdelieTorgersen34.118.11933475missing
10AdelieTorgersen42.020.21904250missing
11AdelieTorgersen37.817.11863300missing
12AdelieTorgersen37.817.31803700missing
13AdelieTorgersen41.117.61823200female
141AdelieDream40.217.11933400female
142AdelieDream40.617.21873475male
143AdelieDream32.115.51883050female
144AdelieDream40.717.01903725male
145AdelieDream37.316.81923000female
146AdelieDream39.018.71853650male
147AdelieDream39.218.61904250male
148AdelieDream36.618.41843475female
149AdelieDream36.017.81953450female
150AdelieDream37.818.11933750male
151AdelieDream36.017.11873700female
152AdelieDream41.518.52014000male
+
+
+
+ + +
+ +
+ + +
+ + + + + + \ No newline at end of file diff --git a/docs/dataframes.html b/docs/dataframes.html index 0314771..47eab8e 100644 --- a/docs/dataframes.html +++ b/docs/dataframes.html @@ -7,7 +7,7 @@ -1  Dataframes – Tidier Data Science with Julia +Part 2: Dataframes – Tidier Data Science with Julia