Skip to content

Commit

Permalink
Updating tibble join notes
Browse files Browse the repository at this point in the history
  • Loading branch information
lucylgao committed Sep 13, 2023
1 parent e255aaa commit 6dd6442
Show file tree
Hide file tree
Showing 19 changed files with 23 additions and 260 deletions.
1 change: 1 addition & 0 deletions content/notes/notes-a11/images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A folder for images used in the tutorial.
Binary file added content/notes/notes-a11/images/anti_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/bind_cols.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/bind_rows.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/df.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/full_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/inner_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/intersect.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/left_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/new_df.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/new_df2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/right_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/semi_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/setdiff.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/three_tibbles.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/notes/notes-a11/images/union.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
149 changes: 11 additions & 138 deletions content/notes/notes-a11.Rmd → content/notes/notes-a11/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ Video lecture:

- [Tibble Joins with dplyr](https://youtu.be/YAdX9MVRY1c)

Demonstration .Rmd file:
<!-- Demonstration .Rmd file: -->

- [Tibble join demonstration with gapminder](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/notes-a11/gapminder_demonstration.Rmd)
<!-- - [Tibble join demonstration with gapminder](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/notes-a11/gapminder_demonstration.Rmd) -->

Other resources, in addition to the notes below:

Expand Down Expand Up @@ -179,15 +179,15 @@ full_join(df4, df5, by = c("FirstName" = "First_name", "LastName" = "Last_name")
```


What if you did not realize that multiple people shared the same Last Name?
```{r}
full_join(df4, df5, by = c("LastName" = "Last_name"))
```
<!-- What if you did not realize that multiple people shared the same Last Name? -->
<!-- ```{r} -->
<!-- full_join(df4, df5, by = c("LastName" = "Last_name")) -->
<!-- ``` -->

What if you did not realize that multiple people shared the same First Name?
```{r}
full_join(df4, df5, by = c("FirstName" = "First_name"))
```
<!-- What if you did not realize that multiple people shared the same First Name? -->
<!-- ```{r} -->
<!-- full_join(df4, df5, by = c("FirstName" = "First_name")) -->
<!-- ``` -->


## Set operations
Expand Down Expand Up @@ -255,134 +255,7 @@ We think the best way to learn the basics of tibble joins from here is to work t

There will be some class time to go over solutions if you got stuck on any questions.

If we have time left, then we'll do a case study with the `gapminder` data.


<!-- ## Demonstration with `gapminder` -->

<!-- Get an overview of `gapminder` data -->
<!-- ```{r} -->
<!-- glimpse(gapminder) -->
<!-- ``` -->

<!-- ### Part 1 -->

<!-- Obtain additional information on countries from other open data sources -->
<!-- ```{r} -->
<!-- country_data <- read.csv(file = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--geo_entity_domain/master/ddf--entities--geo--country.csv") -->

<!-- glimpse(country_data) -->
<!-- ``` -->


<!-- Narrow down information to income groups, OECD status, and religion -->
<!-- ```{r} -->
<!-- country_data <- country_data %>% -->
<!-- select(name, income_groups, g77_and_oecd_countries, main_religion_2008) -->

<!-- # Check data structure -->
<!-- glimpse(country_data) -->
<!-- ``` -->

<!-- Count how many unique country names are in `gapminder` and `country_data` -->
<!-- ```{r} -->
<!-- nlevels(gapminder$country) -->
<!-- nlevels(as.factor(country_data$name)) -->
<!-- ``` -->

<!-- Merge `gapminder` and `country_data` using `left_join()` -->
<!-- ```{r} -->
<!-- gapminder_extended <- left_join(gapminder, country_data, by=c("country"="name")) -->

<!-- head(gapminder_extended) -->
<!-- ``` -->

<!-- **Note:**: `left_join()` is probably the most useful and the most used join. It is often used when you want to expand your existing dataset with new variables from other sources. -->


<!-- Compare lifeExp for OECD, G77, and other countries -->
<!-- ```{r} -->
<!-- gapminder_extended %>% -->
<!-- ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ -->
<!-- geom_boxplot()+ -->
<!-- geom_jitter(aes(color=continent), alpha=0.3)+ -->
<!-- labs(x="Country group") -->
<!-- ``` -->

<!-- Compare lifeExp for OECD, G77, and other countries by most common religion -->
<!-- ```{r} -->
<!-- gapminder_extended %>% -->
<!-- filter(main_religion_2008 %in% c("christian","eastern_religions","muslim")) %>% -->
<!-- ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ -->
<!-- geom_boxplot()+ -->
<!-- geom_jitter(aes(color=continent), alpha=0.3)+ -->
<!-- labs(x="Country group")+ -->
<!-- facet_wrap(~main_religion_2008) -->
<!-- ``` -->


<!-- ### Part 2 -->


<!-- Gapminder data is only available from 1952 to 2007. What if we wanted to examine data after 2007 as well as population projections? -->

<!-- Download population size estimates by country from 1800 to 2100 -->
<!-- ```{r} -->
<!-- population <- gsheet2tbl("https://docs.google.com/spreadsheets/d/14_suWY8fCPEXV0MH7ZQMZ-KndzMVsSsA5HdR-7WqAC0/edit#gid=176703676") -->
<!-- ``` -->


<!-- See what population data looks like -->
<!-- ```{r} -->
<!-- glimpse(population) -->
<!-- ``` -->


<!-- Only retain population estimates after 2007, rename variables to match gapminder variable names -->
<!-- ```{r} -->
<!-- population <- population %>% -->
<!-- filter(time>2007) %>% -->
<!-- rename(year=time, country=name, pop=Population) %>% -->
<!-- select(-geo) -->
<!-- ``` -->


<!-- Add continent data to `population` from `gapminder` -->
<!-- ```{r} -->
<!-- # create a data frame listing continent for every country -->
<!-- continent <- gapminder %>% -->
<!-- select(country, continent) %>% -->
<!-- distinct() -->

<!-- # add continent data to population data frame -->
<!-- population <- left_join(population, continent, by = "country") -->

<!-- # see how many countries are missing continent data by continent -->
<!-- population %>% -->
<!-- group_by(year) %>% -->
<!-- summarise(missing_continent = sum(is.na(continent))) -->
<!-- ``` -->


<!-- Use `bind_rows()` to stack `population` below `gapminder` -->
<!-- ```{r} -->
<!-- gapminder_pop <- bind_rows(gapminder, population) %>% -->
<!-- arrange(country,year) -->
<!-- ``` -->


<!-- Visualize trends in population growth by continent -->
<!-- ```{r} -->
<!-- gapminder_pop %>% -->
<!-- filter(!is.na(continent)) %>% -->
<!-- group_by(continent, year) %>% -->
<!-- summarise(pop=sum(pop)/1000000) %>% -->
<!-- ggplot(aes(x=year, y=pop, fill=continent))+ -->
<!-- geom_area()+ -->
<!-- labs(title="Population projections by continent", -->
<!-- y="Population (in mil)") -->
<!-- ``` -->

<!-- If we have time, then we will do a case study with the `gapminder` data. [Link to case study here](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/supp-a11.Rmd). -->

### Attributions

Expand Down
133 changes: 11 additions & 122 deletions content/notes/notes-a11.html → content/notes/notes-a11/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,8 @@ <h2>Resources</h2>
<ul>
<li><a href="https://youtu.be/YAdX9MVRY1c">Tibble Joins with dplyr</a></li>
</ul>
<p>Demonstration .Rmd file:</p>
<ul>
<li><a href="https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/notes-a11/gapminder_demonstration.Rmd">Tibble join demonstration with gapminder</a></li>
</ul>
<!-- Demonstration .Rmd file: -->
<!-- - [Tibble join demonstration with gapminder](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/notes-a11/gapminder_demonstration.Rmd) -->
<p>Other resources, in addition to the notes below:</p>
<ul>
<li>A comprehensive overview can be found in the <a href="https://r4ds.had.co.nz/relational-data.html">“Relational Data” chapter</a> in “R for Data Science”.</li>
Expand Down Expand Up @@ -190,30 +188,14 @@ <h2>Joining tibbles on multiple conditions</h2>
## 2 Josh Smith 20 167
## 3 Alex Smith 50 190
## 4 Sophie Jones NA 155</code></pre>
<p>What if you did not realize that multiple people shared the same Last Name?</p>
<pre class="r"><code>full_join(df4, df5, by = c(&quot;LastName&quot; = &quot;Last_name&quot;))</code></pre>
<pre><code>## Warning in full_join(df4, df5, by = c(LastName = &quot;Last_name&quot;)): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## &quot;many-to-many&quot;` to silence this warning.</code></pre>
<pre><code>## # A tibble: 6 × 5
## FirstName LastName Age First_name Height
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
## 1 Sophie Wang 42 &lt;NA&gt; NA
## 2 Josh Smith 20 Josh 167
## 3 Josh Smith 20 Alex 190
## 4 Alex Smith 50 Josh 167
## 5 Alex Smith 50 Alex 190
## 6 &lt;NA&gt; Jones NA Sophie 155</code></pre>
<p>What if you did not realize that multiple people shared the same First Name?</p>
<pre class="r"><code>full_join(df4, df5, by = c(&quot;FirstName&quot; = &quot;First_name&quot;))</code></pre>
<pre><code>## # A tibble: 3 × 5
## FirstName LastName Age Last_name Height
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
## 1 Sophie Wang 42 Jones 155
## 2 Josh Smith 20 Smith 167
## 3 Alex Smith 50 Smith 190</code></pre>
<!-- What if you did not realize that multiple people shared the same Last Name? -->
<!-- ```{r} -->
<!-- full_join(df4, df5, by = c("LastName" = "Last_name")) -->
<!-- ``` -->
<!-- What if you did not realize that multiple people shared the same First Name? -->
<!-- ```{r} -->
<!-- full_join(df4, df5, by = c("FirstName" = "First_name")) -->
<!-- ``` -->
</div>
<div id="set-operations" class="section level2">
<h2>Set operations</h2>
Expand Down Expand Up @@ -286,100 +268,7 @@ <h2>Joining tibbles with different types of variables</h2>
<h2>Your turn: learning tibble joins</h2>
<p>We think the best way to learn the basics of tibble joins from here is to work through the corresponding part of Worksheet A5.</p>
<p>There will be some class time to go over solutions if you got stuck on any questions.</p>
<p>If we have time left, then we’ll do a case study with the <code>gapminder</code> data.</p>
<!-- ## Demonstration with `gapminder` -->
<!-- Get an overview of `gapminder` data -->
<!-- ```{r} -->
<!-- glimpse(gapminder) -->
<!-- ``` -->
<!-- ### Part 1 -->
<!-- Obtain additional information on countries from other open data sources -->
<!-- ```{r} -->
<!-- country_data <- read.csv(file = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--geo_entity_domain/master/ddf--entities--geo--country.csv") -->
<!-- glimpse(country_data) -->
<!-- ``` -->
<!-- Narrow down information to income groups, OECD status, and religion -->
<!-- ```{r} -->
<!-- country_data <- country_data %>% -->
<!-- select(name, income_groups, g77_and_oecd_countries, main_religion_2008) -->
<!-- # Check data structure -->
<!-- glimpse(country_data) -->
<!-- ``` -->
<!-- Count how many unique country names are in `gapminder` and `country_data` -->
<!-- ```{r} -->
<!-- nlevels(gapminder$country) -->
<!-- nlevels(as.factor(country_data$name)) -->
<!-- ``` -->
<!-- Merge `gapminder` and `country_data` using `left_join()` -->
<!-- ```{r} -->
<!-- gapminder_extended <- left_join(gapminder, country_data, by=c("country"="name")) -->
<!-- head(gapminder_extended) -->
<!-- ``` -->
<!-- **Note:**: `left_join()` is probably the most useful and the most used join. It is often used when you want to expand your existing dataset with new variables from other sources. -->
<!-- Compare lifeExp for OECD, G77, and other countries -->
<!-- ```{r} -->
<!-- gapminder_extended %>% -->
<!-- ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ -->
<!-- geom_boxplot()+ -->
<!-- geom_jitter(aes(color=continent), alpha=0.3)+ -->
<!-- labs(x="Country group") -->
<!-- ``` -->
<!-- Compare lifeExp for OECD, G77, and other countries by most common religion -->
<!-- ```{r} -->
<!-- gapminder_extended %>% -->
<!-- filter(main_religion_2008 %in% c("christian","eastern_religions","muslim")) %>% -->
<!-- ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ -->
<!-- geom_boxplot()+ -->
<!-- geom_jitter(aes(color=continent), alpha=0.3)+ -->
<!-- labs(x="Country group")+ -->
<!-- facet_wrap(~main_religion_2008) -->
<!-- ``` -->
<!-- ### Part 2 -->
<!-- Gapminder data is only available from 1952 to 2007. What if we wanted to examine data after 2007 as well as population projections? -->
<!-- Download population size estimates by country from 1800 to 2100 -->
<!-- ```{r} -->
<!-- population <- gsheet2tbl("https://docs.google.com/spreadsheets/d/14_suWY8fCPEXV0MH7ZQMZ-KndzMVsSsA5HdR-7WqAC0/edit#gid=176703676") -->
<!-- ``` -->
<!-- See what population data looks like -->
<!-- ```{r} -->
<!-- glimpse(population) -->
<!-- ``` -->
<!-- Only retain population estimates after 2007, rename variables to match gapminder variable names -->
<!-- ```{r} -->
<!-- population <- population %>% -->
<!-- filter(time>2007) %>% -->
<!-- rename(year=time, country=name, pop=Population) %>% -->
<!-- select(-geo) -->
<!-- ``` -->
<!-- Add continent data to `population` from `gapminder` -->
<!-- ```{r} -->
<!-- # create a data frame listing continent for every country -->
<!-- continent <- gapminder %>% -->
<!-- select(country, continent) %>% -->
<!-- distinct() -->
<!-- # add continent data to population data frame -->
<!-- population <- left_join(population, continent, by = "country") -->
<!-- # see how many countries are missing continent data by continent -->
<!-- population %>% -->
<!-- group_by(year) %>% -->
<!-- summarise(missing_continent = sum(is.na(continent))) -->
<!-- ``` -->
<!-- Use `bind_rows()` to stack `population` below `gapminder` -->
<!-- ```{r} -->
<!-- gapminder_pop <- bind_rows(gapminder, population) %>% -->
<!-- arrange(country,year) -->
<!-- ``` -->
<!-- Visualize trends in population growth by continent -->
<!-- ```{r} -->
<!-- gapminder_pop %>% -->
<!-- filter(!is.na(continent)) %>% -->
<!-- group_by(continent, year) %>% -->
<!-- summarise(pop=sum(pop)/1000000) %>% -->
<!-- ggplot(aes(x=year, y=pop, fill=continent))+ -->
<!-- geom_area()+ -->
<!-- labs(title="Population projections by continent", -->
<!-- y="Population (in mil)") -->
<!-- ``` -->
<!-- If we have time, then we will do a case study with the `gapminder` data. [Link to case study here](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/notes/supp-a11.Rmd). -->
<div id="attributions" class="section level3">
<h3>Attributions</h3>
<p>Written by Albina Gibadullina, reviewed by Vincenzo Coia.</p>
Expand Down

0 comments on commit 6dd6442

Please sign in to comment.