Skip to content

Commit

Permalink
fix: add new examples
Browse files Browse the repository at this point in the history
  • Loading branch information
Layalchristine24 committed Aug 23, 2024
1 parent f814f39 commit 391359a
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 6 deletions.
4 changes: 2 additions & 2 deletions _freeze/posts/2024-08-23_sums/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "79a80ef224034957a04b5b5cfedd4c8c",
"hash": "95dfb400abd3ac591bf9990eab86a4f2",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: What are the different ways of summing a variable?\nauthor:\n - name:\n given: Layal Christine\n family: Lettry\n orcid: 0009-0008-6396-0523\n affiliations:\n - id: cynkra\n - name: cynkra GmbH\n city: Zurich\n state: CH\n - id: unifr\n - name: University of Fribourg, Dept. of Informatics, ASAM Group\n city: Fribourg\n state: CH\ndate: 2024-08-23\ncategories: [dplyr, constructive, groups, reframe, summarise, count]\nimage: image.jpg\ncitation: \n url: https://rdiscovery.netlify.app/posts/2024-08-23_sums/\nformat:\n html:\n toc: true\n toc-depth: 6\n toc-title: Contents\n toc-location: right\n number-sections: false\neditor_options: \n chunk_output_type: console\n---\n\n\n\n*How can you summarise a tibble?*\n\n# Initial object\n\nLet's assume that we have the tibble `my_tib` where the variables are `my_chars`, `my_years`, `my_ints` and `my_nums`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_tib <-\n tibble::tibble(\n my_chars = c(rep(LETTERS[1:2], 2), LETTERS[1]),\n my_years = rep(c(2021L, 2025L), 2:3),\n my_ints = 1L:5L,\n my_nums = 1.5:5.5\n )\n\nconstructive::construct(my_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = c(\"A\", \"B\", \"A\", \"B\", \"A\"),\n my_years = rep(c(2021L, 2025L), 2:3),\n my_ints = 1:5,\n my_nums = seq(1.5, 5.5, by = 1),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `summarise()` from dplyr\n\nTo obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use `dplyr::summarise()` (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_grouped <-\n my_tib |>\n dplyr::group_by(my_chars, my_years) |>\n dplyr::summarise(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums)\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`summarise()` has grouped output by 'my_chars'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\nconstructive::construct(my_summarised_tib_grouped)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n) |>\n dplyr::group_by(my_chars)\n```\n\n\n:::\n:::\n\n\n\nTo remove the groups, use `dplyr::ungroup()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib <-\n my_summarised_tib_grouped |>\n dplyr::ungroup() |>\n dplyr::arrange(my_chars)\nconstructive::construct(my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `reframe()` from dplyr\n\nWe can also use `dplyr::reframe()`. We obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib <-\n my_tib |>\n dplyr::reframe(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums),\n .by = c(my_chars, my_years)\n ) |>\n dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Use `count()` from dplyr\n\nEventually, we can run `dplyr::count()` by specifying the frequency weights with the argument `wt`.\nIf `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.\nWe obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_counted_tib <-\n my_tib |>\n dplyr::count(my_chars, my_years, wt = my_ints, name = \"my_ints_sum\") |>\n dplyr::left_join(\n my_tib |>\n dplyr::count(my_chars, my_years, wt = my_nums, name = \"my_nums_sum\"),\n by = dplyr::join_by(my_chars, my_years)\n ) |>\n dplyr::arrange(my_chars)\n\nconstructive::construct(my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Comparison of the three solutions\n\nIn my opinion, the `reframe()` solution is the best and the easiest one because you do not need to worry about ungrouping the tibble at the end of your pipe (with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n:::\n",
"markdown": "---\ntitle: What are the different ways of summing a variable?\nauthor:\n - name:\n given: Layal Christine\n family: Lettry\n orcid: 0009-0008-6396-0523\n affiliations:\n - id: cynkra\n - name: cynkra GmbH\n city: Zurich\n state: CH\n - id: unifr\n - name: University of Fribourg, Dept. of Informatics, ASAM Group\n city: Fribourg\n state: CH\ndate: 2024-08-23\ncategories: [dplyr, constructive, groups, reframe, summarise, count]\nimage: image.jpg\ncitation: \n url: https://rdiscovery.netlify.app/posts/2024-08-23_sums/\nformat:\n html:\n toc: true\n toc-depth: 6\n toc-title: Contents\n toc-location: right\n number-sections: false\neditor_options: \n chunk_output_type: console\n---\n\n\n\n*How can you summarise a tibble?*\n\n# Initial object\n\nLet's assume that we have the tibble `my_tib` where the variables are `my_chars`, `my_years`, `my_ints` and `my_nums`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_tib <-\n tibble::tibble(\n my_chars = c(rep(LETTERS[1:2], 2), LETTERS[1]),\n my_years = rep(c(2021L, 2025L), 2:3),\n my_ints = 1L:5L,\n my_nums = 1.5:5.5\n )\n\nconstructive::construct(my_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = c(\"A\", \"B\", \"A\", \"B\", \"A\"),\n my_years = rep(c(2021L, 2025L), 2:3),\n my_ints = 1:5,\n my_nums = seq(1.5, 5.5, by = 1),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `summarise()` from dplyr\n\nTo obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use [`dplyr::summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_grouped <-\n my_tib |>\n dplyr::group_by(my_chars, my_years) |>\n dplyr::summarise(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums)\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`summarise()` has grouped output by 'my_chars'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\nconstructive::construct(my_summarised_tib_grouped)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n) |>\n dplyr::group_by(my_chars)\n```\n\n\n:::\n:::\n\n\n\nTo remove the groups, use `dplyr::ungroup()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib <-\n my_summarised_tib_grouped |>\n dplyr::ungroup() |>\n dplyr::arrange(my_chars)\nconstructive::construct(my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\nPlease note that using `.by` or `.groups = \"drop\"` will allow you to return an ungrouped tibble automatically.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_alt <-\n my_tib |>\n dplyr::summarise(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums),\n .by = c(my_chars, my_years)\n )\nconstructive::construct(my_summarised_tib_alt)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), 2),\n my_years = rep(c(2021L, 2025L), each = 2L),\n my_ints_sum = c(1L, 2L, 8L, 4L),\n my_nums_sum = c(1.5, 2.5, 9, 4.5),\n)\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_alt_drop <-\n my_tib |>\n dplyr::group_by(my_chars, my_years) |>\n dplyr::summarise(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums),\n .groups = \"drop\"\n )\nconstructive::construct(my_summarised_tib_alt_drop)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `reframe()` from dplyr\n\nWe can also use [`dplyr::reframe()`](https://dplyr.tidyverse.org/reference/reframe.html). We obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib <-\n my_tib |>\n dplyr::reframe(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums),\n .by = c(my_chars, my_years)\n ) |>\n dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\nAn alternative using `dplyr::group_by()` also exists. Note that this solution also returns an ungrouped tibble automatically thanks to `dplyr::reframe()`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib_alt <-\n my_tib |>\n dplyr::group_by(my_chars, my_years) |>\n dplyr::reframe(\n my_ints_sum = sum(my_ints),\n my_nums_sum = sum(my_nums)\n ) |>\n dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib_alt)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `count()` from dplyr\n\nEventually, we can run [`dplyr::count()`](https://dplyr.tidyverse.org/reference/count.html) by specifying the frequency weights with the argument `wt`.\nIf `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.\nWe obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_counted_tib <-\n my_tib |>\n dplyr::count(my_chars, my_years, wt = my_ints, name = \"my_ints_sum\") |>\n dplyr::left_join(\n my_tib |>\n dplyr::count(my_chars, my_years, wt = my_nums, name = \"my_nums_sum\"),\n by = dplyr::join_by(my_chars, my_years)\n ) |>\n dplyr::arrange(my_chars)\n\nconstructive::construct(my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n my_chars = rep(c(\"A\", \"B\"), each = 2L),\n my_years = rep(c(2021L, 2025L), 2),\n my_ints_sum = c(1L, 8L, 2L, 4L),\n my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Comparison of the three solutions\n\nIn my opinion, the `reframe()` solution is the best and the safest one because you do not need to worry about ungrouping the tibble at the end of your pipe (happening in some cases with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
47 changes: 43 additions & 4 deletions posts/2024-08-23_sums/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ constructive::construct(my_tib)

# Use `summarise()` from dplyr

To obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use `dplyr::summarise()` (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.
To obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use [`dplyr::summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.

```{r}
my_summarised_tib_grouped <-
Expand All @@ -72,9 +72,35 @@ my_summarised_tib <-
constructive::construct(my_summarised_tib)
```

Please note that using `.by` or `.groups = "drop"` will allow you to return an ungrouped tibble automatically.

```{r}
my_summarised_tib_alt <-
my_tib |>
dplyr::summarise(
my_ints_sum = sum(my_ints),
my_nums_sum = sum(my_nums),
.by = c(my_chars, my_years)
)
constructive::construct(my_summarised_tib_alt)
```


```{r}
my_summarised_tib_alt_drop <-
my_tib |>
dplyr::group_by(my_chars, my_years) |>
dplyr::summarise(
my_ints_sum = sum(my_ints),
my_nums_sum = sum(my_nums),
.groups = "drop"
)
constructive::construct(my_summarised_tib_alt_drop)
```

# Use `reframe()` from dplyr

We can also use `dplyr::reframe()`. We obtain directly an ungrouped tibble.
We can also use [`dplyr::reframe()`](https://dplyr.tidyverse.org/reference/reframe.html). We obtain directly an ungrouped tibble.

```{r}
my_reframed_tib <-
Expand All @@ -88,10 +114,23 @@ my_reframed_tib <-
constructive::construct(my_reframed_tib)
```

An alternative using `dplyr::group_by()` also exists. Note that this solution also returns an ungrouped tibble automatically thanks to `dplyr::reframe()`.

```{r}
my_reframed_tib_alt <-
my_tib |>
dplyr::group_by(my_chars, my_years) |>
dplyr::reframe(
my_ints_sum = sum(my_ints),
my_nums_sum = sum(my_nums)
) |>
dplyr::arrange(my_chars)
constructive::construct(my_reframed_tib_alt)
```

# Use `count()` from dplyr

Eventually, we can run `dplyr::count()` by specifying the frequency weights with the argument `wt`.
Eventually, we can run [`dplyr::count()`](https://dplyr.tidyverse.org/reference/count.html) by specifying the frequency weights with the argument `wt`.
If `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.
We obtain directly an ungrouped tibble.

Expand All @@ -112,7 +151,7 @@ constructive::construct(my_counted_tib)

# Comparison of the three solutions

In my opinion, the `reframe()` solution is the best and the easiest one because you do not need to worry about ungrouping the tibble at the end of your pipe (with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.
In my opinion, the `reframe()` solution is the best and the safest one because you do not need to worry about ungrouping the tibble at the end of your pipe (happening in some cases with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.

```{r}
waldo::compare(my_reframed_tib, my_counted_tib)
Expand Down

0 comments on commit 391359a

Please sign in to comment.