Skip to content

Commit

Permalink
add dataframes part
Browse files Browse the repository at this point in the history
  • Loading branch information
vituri committed Sep 10, 2024
1 parent 68bb43c commit 2fb752e
Show file tree
Hide file tree
Showing 12 changed files with 5,903 additions and 3,664 deletions.
16 changes: 16 additions & 0 deletions _freeze/dataframes-filtering/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/dataframes/execute-results/html.json

Large diffs are not rendered by default.

16 changes: 14 additions & 2 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,26 @@ book:
title: "Tidier Data Science with Julia"
author: "Guilherme Vituri and Christoph Scheuch"
date: "15/08/2024"
repo-url: https://github.com/vituri/TidierBook2

page-navigation: true
reader-mode: true
page-footer:
left: |
This book is part of the <a href="https://github.com/TidierOrg/Tidier.jl">Tidier organization</a>, bringing joy to
data science in Julia.
right: |
This book was built with <a href="https://quarto.org/">Quarto</a>.
chapters:
- index.qmd
- part: "Part 1: Julia basics"
# chapters:
# - dataframes.qmd
- part: dataframes.qmd
chapters:
- dataframes.qmd
- part: "Part 2: Manipulating data"
- dataframes-filtering.qmd
# - part: "Part 2: Dataframes"
- part: "Part 3: Reading data"
- part: "Part 4: Plotting data"
- part: "Part 5: Applications"
Expand Down
159 changes: 159 additions & 0 deletions dataframes-filtering.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
# jupyter: julia-1.10
engine: julia
---

# Filtering

```{julia}
using DataFrames, PalmerPenguins
using Tidier
import DataFramesMeta as DFM
penguins = PalmerPenguins.load() |> DataFrame;
@slice_head(penguins, n = 15)
```

To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form

```{julia}
@filter(penguins, species == "Adelie")
```

or without parentesis as in

```{julia}
@filter penguins species == "Adelie"
```

Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe as columns.

In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:

```{julia}
DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
```

Notice the broadcast on >=. We need it because *each row is interpreted as an array*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).

In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:

```{julia}
DFM.@rsubset penguins :species == "Adelie"
```

In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.

## Filtering with one criteria

Filtering all the rows with `species` = "Adelie".

::: {.panel-tabset}

## Tidier

```{julia}
@filter penguins species == "Adelie"
```

## DataFramesMeta

```{julia}
DFM.@rsubset penguins :species == "Adelie"
```

## DataFrames

```{julia}
filter(r -> r.species == "Adelie", penguins)
```

:::

## Filtering with several criteria

Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000.

::: {.panel-tabset}

## Tidier

```{julia}
@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
```

## DataFramesMeta

```{julia}
DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
```

## DataFrames

```{julia}
filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins)
```

:::


Filtering all the rows where the `flipper_length_mm` is greater than the mean.

::: {.panel-tabset}

## Tidier

```{julia}
@filter penguins flipper_length_mm > mean(skipmissing(flipper_length_mm))
```

## DataFramesMeta

```{julia}
DFM.@subset penguins :flipper_length_mm .>= mean(skipmissing(:flipper_length_mm))
```

## DataFrames

```{julia}
filter(r -> (r.flipper_length_mm > mean(skipmissing(penguins.flipper_length_mm))) === true, penguins)
```

:::

## Filtering with a variable column name

Suppose the column you want to filter is a variable, let's say

```{julia}
# filter_column = "species"
column_symbol = :species
```

::: {.panel-tabset}

## Tidier

```{julia}
# @chain penguins begin
# @filter(!!filter_column == "Adelie")
# # @select(!!filter_column)
# end
# @filter(penguins, !!filter_column == "Adelie")
```

## DataFramesMeta

```{julia}
DFM.@rsubset penguins $column_symbol == "Adelie"
```

:::

In case the column is a string instead of a symbol, we can write

```{julia}
column_string = "species"
DFM.@rsubset penguins $(Symbol(column_string)) == "Adelie"
```
16 changes: 16 additions & 0 deletions dataframes-mutating.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
# jupyter: julia-1.10
engine: julia
---

## Creating columns

::: {.panel-tabset}

## Tidier

## DataFramesMeta

## DataFrames

:::
121 changes: 19 additions & 102 deletions dataframes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
engine: julia
---

# Dataframes
# Part 2: Dataframes

Dataframes are one of the most important objects in data science. A dataframe is a table where each row is an observation and each column is a variable.

We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.

```{julia}
using DataFrames, PalmerPenguins
using Tidier
using Tidier, Chain
import DataFramesMeta as DFM
penguins = PalmerPenguins.load() |> DataFrame
Expand All @@ -21,33 +21,31 @@ penguins = PalmerPenguins.load() |> DataFrame

`Dataframes.jl` is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.

DataFramesMeta is a collection of macros
DataFramesMeta is a collection of macros based on DataFrames.

Tidier is inspired by the `tidyverse` ecosystem in R. They use macros to rewrite your code into DataFrames.jl code.
Tidier is inspired by the `tidyverse` ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this "tidy" heritance, we will often talk about the R packages that inspired the Julia ones (like `dplyr`, `tidyr` and many others).

In this book, whenever reasonable, we will show the different approaches in a tabset so you can compare them!
In this book, whenever possible, we will show the different approaches in a tabset so you can compare them.
:::

## Operations

In this chapter, we will see some unary operations on dataframes. These functions take just 1 dataframe. Joins are binary operations and will be seen later.
Let's start with some operations that take only one dataframe as input.^[Join operations will be dealt later.]. Here is the basic terminology:

- *Selecting* is when we select some columns of a dataframe, while keeping all the rows. Example: select the `species` and `sex` columns.

- *Filtering* or *subsetting* is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.

- *Mutating* is when we create new columns. Example: The body mass in kg is obtained dividing the column `body_mass_g` by 1000.
- *Mutating* or *transforming* is when we create new columns. Example: a new column `body_mass_kg` can be obtained dividing the column `body_mass_g` by 1000.

- *Grouping* is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by `species` gives us 3 dataframes, each with only one species.

- *Summarising* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.
- *Summarising* or *combining* is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each `species`, apply the `mean` function to the columns `body_mass_g`. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.

- *Arranging* or *ordering* is when we reorder the rows of a dataframe using some criteria.

Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.

Let's see each operation with more details.

## Comparing Tidier with DataFramesMeta

The following table list the operations on each package:
Expand All @@ -61,102 +59,21 @@ The following table list the operations on each package:
| `summarise` | `@summarise` | `@combine` | `combine` |
| `arrange` | `@arrange` | `@orderby` / `@rorderby` | `sort!` |

It is clear that for those coming from `R`, Tidier will look like the most natural approach.

Notice that we have a name clash with `@select`: that is why we `import DataFramesMeta as DFM` at the beginning.

## Filtering / subsetting

To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form
We will see each operation with more details in the following chapters.

```{julia}
@filter(penguins, species == "Adelie")
```
## Chaining operations

or without parentesis as in
We can chain (or pipe) dataframe operations as follows with the `@chain` macro:

```{julia}
@filter penguins species == "Adelie"
```

Notice that the columns are typed as if they were variables on the Julia environment. This is inspired by the `tidyverse` behaviour of data-masking: inside a tidyverse verb, the columns are taken as "statistical variables" that exist inside the dataframe.

In DataFramesMeta, we have two macros for filtering: `@subset` and `@rsubset`. Use the first when you have some criteria that uses the whole dataframe, for example:

```{julia}
DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))
```

Notice the broadcast on >=. We need it because each *row is interpreted as an array*. Also, notice that we call columns as _symbols_ (i.e. we append `:` to it).

In this case, we need the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row, then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:

```{julia}
DFM.@rsubset penguins :species == "Adelie"
```

In both Tidier and DataFramesMeta, only the rows to which the criteria is `true` are returned. This means that you don't need to worry about `missing` values in cases where the criteria do not return `false` nor `true.

### Filtering with one criteria

Filtering all the rows with `species` = "Adelie".

::: {.panel-tabset}

## Tidier

```{julia}
@filter penguins species == "Adelie"
```

## DataFramesMeta

```{julia}
DFM.@rsubset penguins :species == "Adelie"
```

## DataFrames

```{julia}
filter(r -> r.species == "Adelie", penguins)
```

:::

### Filtering with several criteria

Filtering all the rows with `species` = "Adelie", `sex` = "male" and `body_mass_g` > 4000.

::: {.panel-tabset}

## Tidier

```{julia}
@filter penguins species == "Adelie" sex == "male" body_mass_g > 4000
```

## DataFramesMeta

```{julia}
DFM.@rsubset penguins :species == "Adelie" :sex == "male" :body_mass_g > 4000
```

## DataFrames

```{julia}
filter(r -> ((r.species == "Adelie") & (r.sex == "male") & (r.body_mass_g > 4000)) === true, penguins)
```

:::


## Creating columns

::: {.panel-tabset}

## Tidier

## DataFramesMeta

## DataFrames

:::
@chain penguins begin
@filter !ismissing(sex)
@group_by sex
@summarise mean = mean(bill_length_mm)
@arrange mean
end
```
Loading

0 comments on commit 2fb752e

Please sign in to comment.