Skip to content

Commit

Permalink
add eval to interpolate in tidier
Browse files Browse the repository at this point in the history
  • Loading branch information
vituri committed Oct 13, 2024
1 parent cab7a58 commit 1870874
Show file tree
Hide file tree
Showing 17 changed files with 2,822 additions and 831 deletions.
3 changes: 3 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
Chain = "8be319e6-bccf-4806-a6f7-6fae938471bc"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
PalmerPenguins = "8b842266-38fa-440a-9b57-31493939ab85"
QuartoNotebookRunner = "4c0109c6-14e9-4c88-93f0-2b974d3468f4"
REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"
TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"
TidierFiles = "8ae5e7a9-bdd3-4c93-9cc3-9df4d5d947db"
4 changes: 2 additions & 2 deletions _freeze/dataframes-columns/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/dataframes-rows/execute-results/html.json

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions _freeze/dataframes/execute-results/html.json

Large diffs are not rendered by default.

8 changes: 6 additions & 2 deletions dataframes-columns.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ penguins = PalmerPenguins.load() |> DataFrame;

### Selecting `n` columns

**Problem:** Select only some columns.

::: {.panel-tabset}

## Tidier
Expand All @@ -42,6 +44,8 @@ DFM.select(penguins, [:species, :body_mass_g])

### Selecting columns from a variable

**Problem:** Select only some columns whose names are stored in a variable.

::: {.panel-tabset}

```{julia}
Expand All @@ -51,7 +55,7 @@ my_columns = [:species, :body_mass_g];
## Tidier

```{julia}
@select penguins !!my_columns
@eval @select penguins $my_columns...
```

## DataFramesMeta
Expand All @@ -72,7 +76,7 @@ DFM.select(penguins, my_columns)

### Creating one column based on another one

Create the column `body_mass_kg` by dividing `body_mass_g` by 1000.
**Problem:** Create the column `body_mass_kg` by dividing `body_mass_g` by 1000.

::: {.panel-tabset}

Expand Down
52 changes: 33 additions & 19 deletions dataframes-rows.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ engine: julia

# Operations on rows

In this chapter we will see operations that deal with rows, be it ordering or throwing some rows away.

The following is necessary to run all examples:

```{julia}
using DataFrames, PalmerPenguins
using Tidier
Expand All @@ -14,11 +18,11 @@ penguins = PalmerPenguins.load() |> DataFrame;
@slice_head(penguins, n = 10)
```

## Filtering (or: throwing lines away)
## Filtering (or: throwing rows away)

To filter a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).
To *filter* a dataframe means keeping only the rows that satisfy a certain criteria (ie. a boolean condition).

To filter a dataframe in Tidier, we use the macro `@filter`. You can use it in the form
To filter in Tidier, we use the macro `@filter`. You can use it in the form

```{julia}
@filter(penguins, species == "Adelie")
Expand All @@ -40,7 +44,7 @@ DFM.@subset penguins :body_mass_g .>= mean(skipmissing(:body_mass_g))

Notice the broadcast on >=. We need it because *each variable is interpreted as a vector (the whole column)*. Also, notice that we refer to columns as _symbols_ (i.e. we append `:` to it).

In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (row subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:
In the above example, we needed the whole column `body_mass_g` to take the mean and then filter the rows based on that. If, however, your filtering criteria only uses information about each row (without needing to see it in context of the whole column), then `@rsubset` (**r**ow subset) is easier to use: it interprets each columns as a value (not an array), so no broadcasting is needed:

```{julia}
DFM.@rsubset penguins :species == "Adelie"
Expand All @@ -57,11 +61,11 @@ subset(penguins, :column => boolean_function)
```

where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable. Add the kwarg `skipmissing=true` if you want to get rid of missing values.
where `boolean_function` is a boolean (with possibly `missing` values) function on 1 variable (the `:column` you passed). Add the kwarg `skipmissing=true` if you want to get rid of missing values.

### Filtering with one criteria

Filtering all the rows with `species` == "Adelie".
**Problem:** Filtering all the rows with `species` == "Adelie".

::: {.panel-tabset}

Expand All @@ -87,7 +91,7 @@ subset(penguins, :species => x -> x .== "Adelie", skipmissing=true)

### Filtering with several criteria

Filtering all the rows with `species` == "Adelie", `sex` == "male" and `body_mass_g` > 4000.
**Problem:** Filtering all the rows with `species` == "Adelie", `sex` == "male" and `body_mass_g` > 4000.

::: {.panel-tabset}

Expand Down Expand Up @@ -116,8 +120,7 @@ subset(

:::


Filtering all the rows with `species` == "Adelie" OR `sex` == "male".
**Problem:** Filtering all the rows with `species` == "Adelie" OR `sex` == "male".

::: {.panel-tabset}

Expand All @@ -141,8 +144,11 @@ subset(penguins, [:species, :sex] => (x, y) -> (x .== "Adelie") .| (y .== "male"

:::

### Filtering with metadata

Filtering all the rows where the `flipper_length_mm` is greater than the mean.
By metadata here we mean data that is inside the dataframe, as the mean/max/min of a column.

**Problem:** Filtering all the rows where the `flipper_length_mm` is greater than the mean.

::: {.panel-tabset}

Expand All @@ -168,14 +174,22 @@ subset(penguins, :flipper_length_mm => x -> x .> mean(skipmissing(x)), skipmissi

### Filtering with a variable column name

Suppose the column you want to filter is a variable, let's say
Suppose the column you want to filter is a variable, let's say a symbol

```{julia}
my_column = :species;
```

**Problem:** Filtering all the rows where the column stored in `my_column` is "Adelie".

::: {.panel-tabset}

## Tidier

```{julia}
@eval @filter penguins $my_column == "Adelie"
```

## DataFramesMeta

```{julia}
Expand All @@ -196,16 +210,17 @@ In case the column is a string
my_column_string = "species";
```

instead of a symbol, we can write in the same way
instead of a symbol, we can write in the same way, just taking care in Tidier to convert it to a symbol

::: {.panel-tabset}

## Tidier

```{julia}
# @filter(penguins, !!my_column == "Adelie")
@eval @filter penguins $(Symbol(my_column_string)) == "Adelie"
```


## DataFramesMeta

```{julia}
Expand All @@ -222,11 +237,11 @@ subset(penguins, my_column_string => x -> x .== "Adelie")

## Arranging

Arranging is when we reorder the rows of a dataframe according to some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.
To *arrange* a dataframe means to reorder the rows according to the order of some columns. The rows are first arranged by the first column, then by the second (if any), and so on. In Tidier, when we want to invert the ordering, just put the column name inside a `desc()` call.

### Arranging by one column

Arrange by `body_mass_g`.
**Problem:** Arrange by `body_mass_g`.

::: {.panel-tabset}

Expand All @@ -252,7 +267,7 @@ sort(penguins, :body_mass_g)

### Arranging by two columns, with one reversed

First arrange by `island`, then by reversed `body_mass_g`.
**Problem:** First arrange by `island`, then by reversed `body_mass_g`.

::: {.panel-tabset}

Expand Down Expand Up @@ -280,7 +295,7 @@ sort(penguins, [order(:island), order(:body_mass_g, rev=true)])

### Arranging by one variable column

Let's arrange the data by the following column:
**Problem:** Arrange by a column stored in a variable `my_arrange_column`.

```{julia}
my_arrange_column = :body_mass_g;
Expand All @@ -291,8 +306,7 @@ my_arrange_column = :body_mass_g;
## Tidier

```{julia}
#?? how to do it?
# @arrange penguins !!my_arrange_column
@eval @arrange penguins $my_arrange_column
```

## DataFramesMeta
Expand Down
Loading

0 comments on commit 1870874

Please sign in to comment.