fix: add new examples

Layalchristine24 · Aug 23, 2024 · 391359a · 391359a
1 parent f814f39
commit 391359a
Show file tree

Hide file tree

Showing 2 changed files with 45 additions and 6 deletions.
diff --git a/_freeze/posts/2024-08-23_sums/index/execute-results/html.json b/_freeze/posts/2024-08-23_sums/index/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "79a80ef224034957a04b5b5cfedd4c8c",
+  "hash": "95dfb400abd3ac591bf9990eab86a4f2",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: What are the different ways of summing a variable?\nauthor:\n  - name:\n      given: Layal Christine\n      family: Lettry\n      orcid: 0009-0008-6396-0523\n    affiliations:\n      - id: cynkra\n      - name: cynkra GmbH\n        city: Zurich\n        state: CH\n      - id: unifr\n      - name: University of Fribourg, Dept. of Informatics, ASAM Group\n        city: Fribourg\n        state: CH\ndate: 2024-08-23\ncategories: [dplyr, constructive, groups, reframe, summarise, count]\nimage: image.jpg\ncitation: \n  url: https://rdiscovery.netlify.app/posts/2024-08-23_sums/\nformat:\n  html:\n    toc: true\n    toc-depth: 6\n    toc-title: Contents\n    toc-location: right\n    number-sections: false\neditor_options: \n  chunk_output_type: console\n---\n\n\n\n*How can you summarise a tibble?*\n\n# Initial object\n\nLet's assume that we have the tibble `my_tib` where the variables are `my_chars`, `my_years`, `my_ints` and `my_nums`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_tib <-\n  tibble::tibble(\n    my_chars = c(rep(LETTERS[1:2], 2), LETTERS[1]),\n    my_years = rep(c(2021L, 2025L), 2:3),\n    my_ints = 1L:5L,\n    my_nums = 1.5:5.5\n  )\n\nconstructive::construct(my_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = c(\"A\", \"B\", \"A\", \"B\", \"A\"),\n  my_years = rep(c(2021L, 2025L), 2:3),\n  my_ints = 1:5,\n  my_nums = seq(1.5, 5.5, by = 1),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `summarise()` from dplyr\n\nTo obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use `dplyr::summarise()` (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_grouped <-\n  my_tib |>\n  dplyr::group_by(my_chars, my_years) |>\n  dplyr::summarise(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums)\n  )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`summarise()` has grouped output by 'my_chars'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\nconstructive::construct(my_summarised_tib_grouped)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n) |>\n  dplyr::group_by(my_chars)\n```\n\n\n:::\n:::\n\n\n\nTo remove the groups, use `dplyr::ungroup()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib <-\n  my_summarised_tib_grouped |>\n  dplyr::ungroup() |>\n  dplyr::arrange(my_chars)\nconstructive::construct(my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `reframe()` from dplyr\n\nWe can also use `dplyr::reframe()`. We obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib <-\n  my_tib |>\n  dplyr::reframe(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums),\n    .by = c(my_chars, my_years)\n  ) |>\n  dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Use `count()` from dplyr\n\nEventually, we can run `dplyr::count()` by specifying the frequency weights with the argument `wt`.\nIf `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.\nWe obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_counted_tib <-\n  my_tib |>\n  dplyr::count(my_chars, my_years, wt = my_ints, name = \"my_ints_sum\") |>\n  dplyr::left_join(\n    my_tib |>\n      dplyr::count(my_chars, my_years, wt = my_nums, name = \"my_nums_sum\"),\n    by = dplyr::join_by(my_chars, my_years)\n  ) |>\n  dplyr::arrange(my_chars)\n\nconstructive::construct(my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Comparison of the three solutions\n\nIn my opinion, the `reframe()` solution is the best and the easiest one because you do not need to worry about ungrouping the tibble at the end of your pipe (with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n:::\n",
+    "markdown": "---\ntitle: What are the different ways of summing a variable?\nauthor:\n  - name:\n      given: Layal Christine\n      family: Lettry\n      orcid: 0009-0008-6396-0523\n    affiliations:\n      - id: cynkra\n      - name: cynkra GmbH\n        city: Zurich\n        state: CH\n      - id: unifr\n      - name: University of Fribourg, Dept. of Informatics, ASAM Group\n        city: Fribourg\n        state: CH\ndate: 2024-08-23\ncategories: [dplyr, constructive, groups, reframe, summarise, count]\nimage: image.jpg\ncitation: \n  url: https://rdiscovery.netlify.app/posts/2024-08-23_sums/\nformat:\n  html:\n    toc: true\n    toc-depth: 6\n    toc-title: Contents\n    toc-location: right\n    number-sections: false\neditor_options: \n  chunk_output_type: console\n---\n\n\n\n*How can you summarise a tibble?*\n\n# Initial object\n\nLet's assume that we have the tibble `my_tib` where the variables are `my_chars`, `my_years`, `my_ints` and `my_nums`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_tib <-\n  tibble::tibble(\n    my_chars = c(rep(LETTERS[1:2], 2), LETTERS[1]),\n    my_years = rep(c(2021L, 2025L), 2:3),\n    my_ints = 1L:5L,\n    my_nums = 1.5:5.5\n  )\n\nconstructive::construct(my_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = c(\"A\", \"B\", \"A\", \"B\", \"A\"),\n  my_years = rep(c(2021L, 2025L), 2:3),\n  my_ints = 1:5,\n  my_nums = seq(1.5, 5.5, by = 1),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `summarise()` from dplyr\n\nTo obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use [`dplyr::summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_grouped <-\n  my_tib |>\n  dplyr::group_by(my_chars, my_years) |>\n  dplyr::summarise(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums)\n  )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`summarise()` has grouped output by 'my_chars'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\nconstructive::construct(my_summarised_tib_grouped)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n) |>\n  dplyr::group_by(my_chars)\n```\n\n\n:::\n:::\n\n\n\nTo remove the groups, use `dplyr::ungroup()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib <-\n  my_summarised_tib_grouped |>\n  dplyr::ungroup() |>\n  dplyr::arrange(my_chars)\nconstructive::construct(my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\nPlease note that using `.by` or `.groups = \"drop\"` will allow you to return an ungrouped tibble automatically.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_alt <-\n  my_tib |>\n  dplyr::summarise(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums),\n    .by = c(my_chars, my_years)\n  )\nconstructive::construct(my_summarised_tib_alt)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), 2),\n  my_years = rep(c(2021L, 2025L), each = 2L),\n  my_ints_sum = c(1L, 2L, 8L, 4L),\n  my_nums_sum = c(1.5, 2.5, 9, 4.5),\n)\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_summarised_tib_alt_drop <-\n  my_tib |>\n  dplyr::group_by(my_chars, my_years) |>\n  dplyr::summarise(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums),\n    .groups = \"drop\"\n  )\nconstructive::construct(my_summarised_tib_alt_drop)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `reframe()` from dplyr\n\nWe can also use [`dplyr::reframe()`](https://dplyr.tidyverse.org/reference/reframe.html). We obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib <-\n  my_tib |>\n  dplyr::reframe(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums),\n    .by = c(my_chars, my_years)\n  ) |>\n  dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\nAn alternative using `dplyr::group_by()` also exists. Note that this solution also returns an ungrouped tibble automatically thanks to `dplyr::reframe()`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_reframed_tib_alt <-\n  my_tib |>\n  dplyr::group_by(my_chars, my_years) |>\n  dplyr::reframe(\n    my_ints_sum = sum(my_ints),\n    my_nums_sum = sum(my_nums)\n  ) |>\n  dplyr::arrange(my_chars)\nconstructive::construct(my_reframed_tib_alt)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n# Use `count()` from dplyr\n\nEventually, we can run [`dplyr::count()`](https://dplyr.tidyverse.org/reference/count.html) by specifying the frequency weights with the argument `wt`.\nIf `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.\nWe obtain directly an ungrouped tibble. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_counted_tib <-\n  my_tib |>\n  dplyr::count(my_chars, my_years, wt = my_ints, name = \"my_ints_sum\") |>\n  dplyr::left_join(\n    my_tib |>\n      dplyr::count(my_chars, my_years, wt = my_nums, name = \"my_nums_sum\"),\n    by = dplyr::join_by(my_chars, my_years)\n  ) |>\n  dplyr::arrange(my_chars)\n\nconstructive::construct(my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble::tibble(\n  my_chars = rep(c(\"A\", \"B\"), each = 2L),\n  my_years = rep(c(2021L, 2025L), 2),\n  my_ints_sum = c(1L, 8L, 2L, 4L),\n  my_nums_sum = c(1.5, 9, 2.5, 4.5),\n)\n```\n\n\n:::\n:::\n\n\n\n\n# Comparison of the three solutions\n\nIn my opinion, the `reframe()` solution is the best and the safest one because you do not need to worry about ungrouping the tibble at the end of your pipe (happening in some cases with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_counted_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n\n```{.r .cell-code}\nwaldo::compare(my_reframed_tib, my_summarised_tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n✔ No differences\n```\n\n\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"

diff --git a/posts/2024-08-23_sums/index.qmd b/posts/2024-08-23_sums/index.qmd
@@ -50,7 +50,7 @@ constructive::construct(my_tib)
 
 # Use `summarise()` from dplyr
 
-To obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use `dplyr::summarise()` (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.
+To obtain the total sum of each of the variables for each of the letters `A` and `B` and for each year, we can use [`dplyr::summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) (or `dplyr::summarize()`) together with `dplyr::goup_by()` before it and `dplyr::ungroup()` at the end, to ensure we have an ungrouped tibble to work with.
 
 ```{r}
 my_summarised_tib_grouped <-
@@ -72,9 +72,35 @@ my_summarised_tib <-
 constructive::construct(my_summarised_tib)
 ``` 
 
+Please note that using `.by` or `.groups = "drop"` will allow you to return an ungrouped tibble automatically.
+
+```{r}
+my_summarised_tib_alt <-
+  my_tib |>
+  dplyr::summarise(
+    my_ints_sum = sum(my_ints),
+    my_nums_sum = sum(my_nums),
+    .by = c(my_chars, my_years)
+  )
+constructive::construct(my_summarised_tib_alt)
+```
+
+
+```{r}
+my_summarised_tib_alt_drop <-
+  my_tib |>
+  dplyr::group_by(my_chars, my_years) |>
+  dplyr::summarise(
+    my_ints_sum = sum(my_ints),
+    my_nums_sum = sum(my_nums),
+    .groups = "drop"
+  )
+constructive::construct(my_summarised_tib_alt_drop)
+```
+
 # Use `reframe()` from dplyr
 
-We can also use `dplyr::reframe()`. We obtain directly an ungrouped tibble. 
+We can also use [`dplyr::reframe()`](https://dplyr.tidyverse.org/reference/reframe.html). We obtain directly an ungrouped tibble. 
 
 ```{r}
 my_reframed_tib <-
@@ -88,10 +114,23 @@ my_reframed_tib <-
 constructive::construct(my_reframed_tib)
 ```
 
+An alternative using `dplyr::group_by()` also exists. Note that this solution also returns an ungrouped tibble automatically thanks to `dplyr::reframe()`.
+
+```{r}
+my_reframed_tib_alt <-
+  my_tib |>
+  dplyr::group_by(my_chars, my_years) |>
+  dplyr::reframe(
+    my_ints_sum = sum(my_ints),
+    my_nums_sum = sum(my_nums)
+  ) |>
+  dplyr::arrange(my_chars)
+constructive::construct(my_reframed_tib_alt)
+```
 
 # Use `count()` from dplyr
 
-Eventually, we can run `dplyr::count()` by specifying the frequency weights with the argument `wt`.
+Eventually, we can run [`dplyr::count()`](https://dplyr.tidyverse.org/reference/count.html) by specifying the frequency weights with the argument `wt`.
 If `wt` is not `NULL`, the sum of the specified variable is returned for each group given in the first argument.
 We obtain directly an ungrouped tibble. 
 
@@ -112,7 +151,7 @@ constructive::construct(my_counted_tib)
 
 # Comparison of the three solutions
 
-In my opinion, the `reframe()` solution is the best and the easiest one because you do not need to worry about ungrouping the tibble at the end of your pipe (with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.
+In my opinion, the `reframe()` solution is the best and the safest one because you do not need to worry about ungrouping the tibble at the end of your pipe (happening in some cases with `summarise()`) nor joining another summarised tibble (with `count()`) to obtain the sums of the variables you want.
 
 ```{r}
 waldo::compare(my_reframed_tib, my_counted_tib)