Skip to content

Commit

Permalink
Merge pull request #18 from nhs-r-community/tidy-joins
Browse files Browse the repository at this point in the history
Tidy joins
  • Loading branch information
Lextuga007 authored Jan 19, 2024
2 parents 6c26a63 + 06c3267 commit 3b11875
Show file tree
Hide file tree
Showing 27 changed files with 179 additions and 168 deletions.
8 changes: 7 additions & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
project:
title: "intro-r-nhse"
render:
- "*.qmd"
- "!*.md"
author: "Zoë Turner"
format:
revealjs:
Expand All @@ -10,4 +13,7 @@ format:
preview-links: true
title-slide-attributes:
data-background-color: "#43464B"
embed-resources: true
embed-resources: true
execute:
echo: true
eval: false
2 changes: 1 addition & 1 deletion session-break-slide.html
Original file line number Diff line number Diff line change
Expand Up @@ -1635,7 +1635,7 @@

</section><section class="slide level2"><div class="cell">
<div class="cell-output-display">
<div class="countdown" id="timer_6c4bfcae" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown" id="timer_1bafdc03" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown-controls">
<button class="countdown-bump-down">−</button><button class="countdown-bump-up">&amp;plus;</button>
</div>
Expand Down
14 changes: 8 additions & 6 deletions session-csv-url.html

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions session-csv-url.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@
title: "Introduction to R and Rstudio"
subtitle: "Session - import csv from the web"
author: "Zoë Turner"
execute:
echo: true
eval: false
---


Expand All @@ -19,19 +16,23 @@ This removes a step in the Reproducible Analytical Pipeline where a person is re

We will use the "Import Dataset" button which is the same way to get csv files from a computer:

<img src="img/session03/import-screenshot.PNG" alt="Screenshot of RStudio with Import Dataset drop down button highlighted as well as the file capacity_ae.csv in the Files tab." class="center" width="80%"/>
<img src="img/session03/import-screenshot.PNG" alt="Screenshot of RStudio with Import Dataset drop down button highlighted" class="center" width="80%"/>

## File location - using a url

Instead of selecting a file, instead type the following into the top box and press `Update`
Instead of selecting a file, instead copy the url into the `File/URL` box next to `Browse`

`https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_cases.csv`
```
https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv
```

. . .

this will give the following code in the wizard (as well as a preview of the data)
Browse changes to `Update` which will pressed will give a preview and the code is:

```{r}
library(readr)
tb_cases <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_cases.csv"))
by_ethnicity <- read_csv(url("https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv"))
```

## End session
3 changes: 0 additions & 3 deletions session-csv.qmd
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
---
title: "Introduction to R and Rstudio"
subtitle: "Session - import csv"
execute:
echo: true
eval: false
---


Expand Down
22 changes: 6 additions & 16 deletions session-datapasta.html

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions session-datapasta.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ subtitle: "Session - import data {datapasta}"
author: "Zoë Turner"
execute:
eval: true
echo: true
---


Expand Down Expand Up @@ -34,9 +33,10 @@ library(datapasta)

## Have a go!

1. Using a small table from the [NHS Workforce statistics statistics](https://www.ethnicity-facts-figures.service.gov.uk/workforce-and-business/workforce-diversity/nhs-workforce/latest#:~:text=Main%20facts%20and%20figures%20around%201.3%20million%20people,22.1%25%20were%20from%20all%20other%20ethnic%20groups%20combined) open the [csv file for number and percentage of NHS staff by ethnicity](https://www.ethnicity-facts-figures.service.gov.uk/workforce-and-business/workforce-diversity/nhs-workforce/latest/downloads/by-ethnicity.csv) 1.Highlight the table, including the column names and copy (Ctrl+C)
2. Select the `Addins/Paste as tribble`
3. Run the code using the `Run` button or Ctrl+Enter anywhere in the code
1. Using a small [csv file for number and percentage of NHS staff by ethnicity](https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv), open the csv, highlight the data including the column names and copy (Ctrl+C)
2. In a script or Quarto R chunk, select the `Addins/Paste as tribble`
3. Add code to make this an object (hint: it needs a name and an assign operator)
3. Run the code

```{r}
#| echo: false
Expand All @@ -55,7 +55,7 @@ countdown::countdown(minutes = 10,
## A data frame in code

```{r}
tibble::tribble(
by_ethnicity <- tibble::tribble(
~Ethnicity, ~`%`, ~Headcount, ~`%.working.age.population.(2011)`,
"Asian", 10, 118396, 7.2,
"Black", 6.1, 72321, 3.4,
Expand Down
1 change: 0 additions & 1 deletion session-dplyr-showcase.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ title: "Introduction to R and Rstudio"
subtitle: "Session - Showing more {dplyr} functions"
author: "Zoë Turner"
execute:
echo: true
eval: true
---

Expand Down
6 changes: 3 additions & 3 deletions session-dplyr-wrangling.html
Original file line number Diff line number Diff line change
Expand Up @@ -1898,7 +1898,7 @@ <h3 id="a-negative-test-of-equality">A negative test of equality</h3>
</div>
<div class="cell">
<div class="cell-output-display">
<div class="countdown" id="timer_cf2830e2" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown" id="timer_232c8004" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown-controls">
<button class="countdown-bump-down">−</button><button class="countdown-bump-up">&amp;plus;</button>
</div>
Expand Down Expand Up @@ -1962,7 +1962,7 @@ <h3 id="a-negative-test-of-equality">A negative test of equality</h3>
<p>Option to take this break before an exercise or after</p>
<div class="cell">
<div class="cell-output-display">
<div class="countdown" id="timer_15a0bda3" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown" id="timer_14b9a9a1" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown-controls">
<button class="countdown-bump-down">−</button><button class="countdown-bump-up">&amp;plus;</button>
</div>
Expand All @@ -1984,7 +1984,7 @@ <h3 id="a-negative-test-of-equality">A negative test of equality</h3>
</div>
<div class="cell">
<div class="cell-output-display">
<div class="countdown" id="timer_a2408125" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown" id="timer_f09e92e7" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:0.9em;font-size:2em;">
<div class="countdown-controls">
<button class="countdown-bump-down">−</button><button class="countdown-bump-up">&amp;plus;</button>
</div>
Expand Down
3 changes: 0 additions & 3 deletions session-dplyr-wrangling.qmd
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
---
title: "Introduction to R and Rstudio"
subtitle: "Session - Cleaning data with {dplyr}"
execute:
echo: true
eval: false
---


Expand Down
1 change: 0 additions & 1 deletion session-intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ title: "Introduction to R and Rstudio"
subtitle: "Session - Introduction"
execute:
echo: false
eval: false
---

## Agenda - session one (about 3.5 hours)
Expand Down
52 changes: 42 additions & 10 deletions session-janitor.html
Original file line number Diff line number Diff line change
Expand Up @@ -1228,9 +1228,9 @@
</ul>
</div>
</section><section id="example-of-cleaning-column-headers" class="slide level2"><h2>Example of cleaning column headers</h2>
<p>Getting the data from following slides</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="fu">library</span>(readr)</span>
<span id="cb1-2"><a href="#cb1-2"></a>by_ethnicity <span class="ot">&lt;-</span> <span class="fu">read_csv</span>(<span class="st">&quot;https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv&quot;</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a>by_ethnicity <span class="ot">&lt;-</span> <span class="fu">read_csv</span>(<span class="st">&quot;https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv&quot;</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section><section id="changing-the-column-names" class="slide level2"><h2>Changing the column names</h2>
<p>Removes spaces and changes the <code>%</code> to a word</p>
Expand All @@ -1244,14 +1244,46 @@
<div class="cell">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a><span class="co"># Add in blank row and column</span></span>
<span id="cb3-2"><a href="#cb3-2"></a></span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="fu">library</span>(dplyr)</span>
<span id="cb3-4"><a href="#cb3-4"></a><span class="fu">library</span>(janitor)</span>
<span id="cb3-5"><a href="#cb3-5"></a>by_ethnicity_blank <span class="ot">&lt;-</span> by_ethnicity <span class="sc">|&gt;</span> </span>
<span id="cb3-6"><a href="#cb3-6"></a> <span class="fu">mutate</span>(<span class="at">blank_column =</span> <span class="cn">NA</span>) <span class="sc">|&gt;</span> <span class="co"># Blank column</span></span>
<span id="cb3-7"><a href="#cb3-7"></a> <span class="fu">add_row</span>() <span class="co"># Blank row</span></span>
<span id="cb3-8"><a href="#cb3-8"></a></span>
<span id="cb3-9"><a href="#cb3-9"></a>by_ethnicity_blank <span class="sc">|&gt;</span> </span>
<span id="cb3-10"><a href="#cb3-10"></a> <span class="fu">remove_empty</span>(<span class="at">which =</span> <span class="fu">c</span>(<span class="st">&quot;rows&quot;</span>, <span class="st">&quot;cols&quot;</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb3-3"><a href="#cb3-3"></a>by_ethnicity_blank <span class="ot">&lt;-</span> by_ethnicity <span class="sc">|&gt;</span> </span>
<span id="cb3-4"><a href="#cb3-4"></a> <span class="fu">mutate</span>(<span class="at">blank_column =</span> <span class="cn">NA</span>) <span class="sc">|&gt;</span> <span class="co"># Blank column</span></span>
<span id="cb3-5"><a href="#cb3-5"></a> <span class="fu">add_row</span>() <span class="co"># Blank row</span></span>
<span id="cb3-6"><a href="#cb3-6"></a></span>
<span id="cb3-7"><a href="#cb3-7"></a>by_ethnicity_blank <span class="sc">|&gt;</span> </span>
<span id="cb3-8"><a href="#cb3-8"></a> <span class="fu">remove_empty</span>(<span class="at">which =</span> <span class="fu">c</span>(<span class="st">&quot;rows&quot;</span>, <span class="st">&quot;cols&quot;</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section><section id="getting-duplicates" class="slide level2"><h2>Getting duplicates</h2>
<p>Often code removes duplicates but sometimes you’ll want to see all the duplicated information:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a>duplicates <span class="ot">&lt;-</span> tibble<span class="sc">::</span><span class="fu">tribble</span>(</span>
<span id="cb4-2"><a href="#cb4-2"></a> <span class="sc">~</span>Ethnicity, <span class="sc">~</span><span class="st">`</span><span class="at">%</span><span class="st">`</span>, <span class="sc">~</span><span class="st">`</span><span class="at">estimated.number.(thousands)</span><span class="st">`</span>,</span>
<span id="cb4-3"><a href="#cb4-3"></a> <span class="st">&quot;All&quot;</span>, <span class="fl">90.8</span>, <span class="dv">48098</span>,</span>
<span id="cb4-4"><a href="#cb4-4"></a> <span class="st">&quot;All&quot;</span>, <span class="fl">90.8</span>, <span class="dv">48098</span>,</span>
<span id="cb4-5"><a href="#cb4-5"></a> <span class="st">&quot;All&quot;</span>, <span class="fl">90.8</span>, <span class="dv">48098</span>,</span>
<span id="cb4-6"><a href="#cb4-6"></a> <span class="st">&quot;Bangladeshi&quot;</span>, <span class="fl">91.9</span>, <span class="dv">354</span>,</span>
<span id="cb4-7"><a href="#cb4-7"></a> <span class="st">&quot;Chinese&quot;</span>, <span class="fl">98.6</span>, <span class="dv">265</span>,</span>
<span id="cb4-8"><a href="#cb4-8"></a> <span class="st">&quot;Indian&quot;</span>, <span class="fl">90.4</span>, <span class="dv">1077</span>,</span>
<span id="cb4-9"><a href="#cb4-9"></a> <span class="st">&quot;Pakistani&quot;</span>, <span class="fl">91.1</span>, <span class="dv">767</span>,</span>
<span id="cb4-10"><a href="#cb4-10"></a> <span class="st">&quot;Asian other&quot;</span>, <span class="fl">95.6</span>, <span class="dv">620</span>,</span>
<span id="cb4-11"><a href="#cb4-11"></a> <span class="st">&quot;Black&quot;</span>, <span class="fl">92.8</span>, <span class="dv">1376</span>,</span>
<span id="cb4-12"><a href="#cb4-12"></a> <span class="st">&quot;Mixed&quot;</span>, <span class="dv">96</span>, <span class="dv">547</span>,</span>
<span id="cb4-13"><a href="#cb4-13"></a> <span class="st">&quot;White&quot;</span>, <span class="fl">90.5</span>, <span class="dv">42296</span>,</span>
<span id="cb4-14"><a href="#cb4-14"></a> <span class="st">&quot;Other&quot;</span>, <span class="fl">94.5</span>, <span class="dv">796</span></span>
<span id="cb4-15"><a href="#cb4-15"></a> )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<div class="fragment">
<div class="cell">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="fu">library</span>(janitor)</span>
<span id="cb5-2"><a href="#cb5-2"></a>duplicates <span class="sc">|&gt;</span> </span>
<span id="cb5-3"><a href="#cb5-3"></a> <span class="fu">get_dupes</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 4
Ethnicity `%` `estimated.number.(thousands)` dupe_count
&lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
1 All 90.8 48098 3
2 All 90.8 48098 3
3 All 90.8 48098 3</code></pre>
</div>
</div>
</div>
</section><section id="end-session" class="slide level2"><h2>End session</h2>

Expand Down
55 changes: 43 additions & 12 deletions session-janitor.qmd
Original file line number Diff line number Diff line change
@@ -1,26 +1,24 @@
---
title: "Introduction to R and Rstudio"
subtitle: "Session - {janitor} clean data"
execute:
echo: true
eval: false
---


## Specific packages to clean data

Packages like {janitor} have functions to do a lot of the cleaning required for data like:

:::incremental
- Remove blank rows and columns
- Change Excel serial dates to read dates
- Standardise column names and remove spaces
::: incremental
- Remove blank rows and columns
- Change Excel serial dates to read dates
- Standardise column names and remove spaces
:::

## Example of cleaning column headers

Getting the data from following slides

```{r}
library(readr)
by_ethnicity <- read_csv("https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv")
```

Expand All @@ -40,8 +38,6 @@ by_ethnicity |>
```{r}
# Add in blank row and column
library(dplyr)
library(janitor)
by_ethnicity_blank <- by_ethnicity |>
mutate(blank_column = NA) |> # Blank column
add_row() # Blank row
Expand All @@ -50,4 +46,39 @@ by_ethnicity_blank |>
remove_empty(which = c("rows", "cols"))
```

## End session
## Getting duplicates

Often code removes duplicates but sometimes you'll want to see all the duplicated information:

```{r}
#| eval: true
duplicates <- tibble::tribble(
~Ethnicity, ~`%`, ~`estimated.number.(thousands)`,
"All", 90.8, 48098,
"All", 90.8, 48098,
"All", 90.8, 48098,
"Bangladeshi", 91.9, 354,
"Chinese", 98.6, 265,
"Indian", 90.4, 1077,
"Pakistani", 91.1, 767,
"Asian other", 95.6, 620,
"Black", 92.8, 1376,
"Mixed", 96, 547,
"White", 90.5, 42296,
"Other", 94.5, 796
)
```

. . .

```{r}
#| eval: true
library(janitor)
duplicates |>
get_dupes()
```

## End session
Loading

0 comments on commit 3b11875

Please sign in to comment.