Padma, 28, has been using Python and Pandas for the last four years to analyze agricultural production in southeast Asia. She has just completed her PhD and accepted a job with a development bank whose staff use R instead. Padma wants learn how do the things she has been doing in Python: read data, clean it, explore it, model it, and present it.
Dafydd, 37, has been building performance dashboards for a logistics company using Django and D3. The company has just hired some data scientists who use R, and Dafydd would like to try building some dashboards in Shiny. He doesn't need to know much about statistics---the analysts will take care of that---but he wants to learn enough of the language to tidy up their packages.
Derived constraints:
- Learners understand loops, conditionals, writing and calling functions, lists, and dictionaries, can load and use libraries, and can use classes and objects, but do not know how to create new classes or how to use higher-order functions.
- Learners understand mean, variation, correlation, and other basic statistical concepts, but are not familiar with statistical tests.
- Learners are comfortable creating and interpreting plots of various kinds.
- Learners have experience with spreadsheets and with writing database queries that use filtering, aggregation, and joins, but have not designed database schemas.
- Learners have no experience building web applications and are not familiar with version control systems or unit testing.
- Learners know how to use a text editor but have never used an IDE.
- Learners are willing to spend half an hour on the basics of the language, both because they understand the utility and as a safety blanket, but will then be impatient to start tackling real tasks.
Here is a sample of data from infant_hiv/raw/infant_hiv.csv
,
where ...
shows values elided to make the segment readable:
"Early Infant Diagnosis: Percentage of infants born to women living with HIV...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,2009,,,2010,,,2011,,,2012,,,2013,,,2014,,,2015,,,2016,,,2017,,,
ISO3,Countries,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,...
AFG,Afghanistan,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
ALB,Albania,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
DZA,Algeria,-,-,-,-,-,-,38%,42%,35%,23%,25%,21%,55%,60%,50%,27%,30%,25%,23%,25%,21%,33%,37%,31%,61%,68%,57%,
AGO,Angola,-,-,-,3%,4%,2%,5%,7%,4%,6%,8%,5%,15%,20%,12%,10%,14%,8%,6%,8%,5%,1%,2%,1%,1%,2%,1%,
=-=-=-=-= many more rows =-=-=-=-=
YEM,Yemen,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
ZMB,Zambia,59%,70%,53%,27%,32%,24%,70%,84%,63%,74%,88%,67%,64%,76%,57%,91%,>95%,81%,43%,52%,39%,43%,51%,39%,46%,54%,41%,
ZWE,Zimbabwe,-,-,-,12%,15%,10%,23%,28%,20%,38%,47%,33%,57%,70%,49%,54%,67%,47%,59%,73%,51%,71%,88%,62%,65%,81%,57%,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,2009,,,2010,,,2011,,,2012,,,2013,,,2014,,,2015,,,2016,,,2017,,,
,,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,...
Region,East Asia and the Pacific,25%,30%,22%,35%,42%,29%,30%,37%,26%,32%,38%,27%,28%,34%,24%,26%,31%,22%,31%,37%,27%,30%,35%,25%,28%,33%,24%,
,Eastern and Southern Africa,23%,29%,20%,44%,57%,37%,48%,62%,40%,54%,69%,46%,51%,65%,43%,62%,80%,53%,62%,79%,52%,54%,68%,45%,62%,80%,53%,
,Eastern Europe and Central Asia,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
=-=-=-=-= several more rows =-=-=-=-=
,Sub-Saharan Africa,16%,22%,13%,34%,46%,28%,37%,50%,30%,43%,57%,35%,41%,54%,33%,50%,66%,41%,50%,66%,41%,45%,60%,37%,52%,69%,42%,
,Global,17%,23%,13%,33%,45%,27%,36%,49%,29%,41%,55%,34%,40%,53%,32%,48%,64%,39%,49%,64%,40%,44%,59%,36%,51%,67%,41%,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Indicator definition: Percentage of infants born to women living with HIV... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Note: Data are not available if country did not submit data...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Data source: Global AIDS Monitoring 2018 and UNAIDS 2018 estimates,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"For more information on this indicator, please visit the guidance:...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"For more information on the data, visit data.unicef.org",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
One of the exercises in this training will be to develop an R script to tidy up this data. After reading R4DS and doing about half of the exercises in it, it took me about two and a half hours to create something that seems to work:
# Libraries.
library(tidyverse)
# Settings.
raw_file <- 'infant_hiv/raw/infant_hiv.csv'
tidy_file <- 'infant_hiv/tidy/infant_hiv.csv'
left_padding <- 2
first_year <- 2009
last_year <- 2017
num_years <- 1 + last_year - first_year
header_height <- 2
num_rows <- 192
column_names <- c("estimate", "hi", "lo")
width <- 3
# Get the chunk of the CSV that's of interest.
raw_data <- read_csv(raw_file, skip = header_height, na = c("-"))
of_interest <- slice(raw_data, 1:num_rows)
# Fix and store ISO3 country names.
of_interest$ISO3[of_interest$Countries == "Kosovo"] = "XKX"
countries <- of_interest$ISO3
# Process each three-column chunk (holding values for one year).
chunks <- vector("list", num_years)
for (i in 1:num_years) {
# Get the three-year chunk.
start_loc = left_padding + (width * (i - 1))
end_loc <- start_loc + width - 1
chunk <- select(of_interest, start_loc:end_loc)
# Convert textual percentages to fractions.
# Reports of ">95%" are converted to 95%. This is probably sinful.
chunk[chunk == ">95%"] <- "95%"
chunk <- mutate_if(chunk, is_character, str_replace, pattern = "%", replacement = "")
chunk <- as_tibble(map(map(chunk, as.numeric), function(x) x / 100.0))
# Create and save a tibble with the right names and the years and countries.
names(chunk) <- column_names
chunk$year <- rep(first_year + i - 1, num_rows)
chunk$country <- countries
chunk <- select(chunk, country, year, everything())
chunks[[i]] <- chunk
}
# Combine all the tibbles and sort by country and year.
tidy <- bind_rows(chunks)
tidy <- arrange(all_data, country, year)
# Save.
write_csv(tidy, tidy_file)
Here are some of the issues I encountered along the way that this training will need to cover, and some questions that I still have.
- Since the first few lines of the CSV aren't tidy, I ignore them, then select rows using row number based on inspection of the raw CSV. Is there a better way to get the data I want to tidy? For example, should I filter to get the row numbers of Afghanistan and Zimbabwe and pull out everything in between? Should I skip the first three lines when reading, accept X1, X2, and so on as column headers, and start from there?
- Once this script was working and I sorted by country and year,
I discovered that Kosovo's ISO3 country code is empty,
so I went back and patched it using the provisional "XKX" code used by NATO.
- I should include some checks during development of this script on empty/missing fields. What's the preferred way to do this?
- What's a better way to write the expression that fills in "XKX" where it's needed?
- In order to construct a tidy tibble,
I pull out three-column chunks corresponding to particular years,
clean them up,
and attach the year and country code.
Once the loop is done,
I stitch the partial tibbles together.
- Is
mutate_if
the right function to use to get rid of the % signs in my raw data? - Is there a cleaner way to replace the ">95%" text with "95%"?
- Is there a better thing to do than make this replacement?
- I shouldn't need to use two
map
calls to convert data to numeric and then divide by 100, butmap(chunk, as.numeric)/100
didn't work (presumably because arithmetic operations only work on vectors, not tables, but then why does the masked assignment that replaces ">95%" work?). - I also don't think I should have to
as_tibble
aftermap
: is there a better way to apply a function to every column while leaving things as tibbles?
- Is
- The assignment to
chunk$year
feels like it ought to work withoutrep
, i.e., that there ought to be a way to recycle the (scalar) value of year when creating a new column. - Is there a clean way to make the country and year the first columns in the chunk
rather than attaching them to the end and then rearranging with
select
? - I constantly trip over
[...]
versus[[...]]
, and am going to go back and review the differences between the two kinds of subscripting. - I keep typing
select
when I should typefilter
. - The IDE editor complains "no symbol named 'country' in scope" for the statement
chunk <- select(chunk, country, year, everything())
. This is fair (there actually isn't a variable calledcountry
), but it surprises me that the IDE would be upset by common tidyverse usage. (I get a similar complaint about uses ofyear
.) - I've committed the
tidynomicon.Rproj
file to my Git repository. Have I sinned?
- What are R's basic data types and how do they compare to Python's?
- How do conditionals and loops work in R compared to Python?
- How do indexing and attribute access work in R?
- How do I create functions and libraries?
- How do I find and install libraries and get help on them?
- How do I process text with regular expressions?
- How do I store and manipulate tabular data?
- How do I load a tabular dataset and calculate simple statistics for each column?
- How do I filter, aggregate, and join datasets?
- How do I create plots?
- How do I write readable programs in R?
These exercises will familiarize you with some of the core features of R. Do all of the work interactively in RStudio until told to do otherwise.
- Create a vector
small
containing the numbers 1 to 5. - Use it to create another vector
large
containing numbers ten times as large. - Use these two vectors to create a third vector
combined
with the sum of the two vectors. - Assign the average and standard deviation of
combined
tocombined_ave
andcombined_std
. - Create a vector
shifted
by subtracting the mean ofcombined
from each value ofcombined
. - Create a logical vector
mask
showing where values ofshifted
are within one standard deviation of zero. - Increase the magnitude of the values in
shifted
that are within one standard deviation of zero by 10%. - Write a function
fiddle
that takes a non-empty numeric vector as input and performs Steps 2-7 above. - Save your function in a file called
fiddle.R
. - Restart RStudio, load the saved function
fiddle
, and apply it to a vector containing the numbers 3 to 7. - Apply
fiddle
to an empty vector. - Modify the saved version of
fiddle
so that it returnsNULL
if its input is not a non-empty numeric vector. - Test your changes by giving
fiddle
a vector with the strings ("3", "4", "5", "6", "7").
This section explores some of the differences between the features of R and Python.
- Explain the difference between
[...]
and[[...]]
and give examples of situations in which they return different results. - Explain what a factor is, create factors, and loop over the names of the columns in a tibble.
- Create a data structure holding the type of each column in a tibble. What data structure do you have to use to hold this information and why?
- Explain what
matrix(rnorm(n * length(mu), mean = mu), ncol = n)
does and how it does it.
Read the CSV data set infant_hiv/solution/infant-hiv.csv
into a tibble.
Use tidyverse functions to answer the following questions:
- How many distinct countries are in the data?
- Which countries reported an estimate for 2009?
- Which countries reported an estimate for every year from 2009 to 2017?
- What were the average estimates for 2014, 2015, and 2016?
- How many countries didn't report estimates for one or more years?
- Which countries never reported a low figure less than 50%?
- Which countries reported a low figure greater than or equal to the estimate, and in which year(s)?
Look at the CSV file infant_hiv/raw/infant-hiv.csv
(which are taken from the Excel spreadsheet located in the same directory).
Write an R script that tidies this data and store the results in infant_hiv/tidy/infant-hiv.csv
.
Compare your result to the file used in previous exercises.
Look at the three CSV files maternal_health/raw/*.csv
(which are taken from the Excel spreadsheet located in the same directory).
Using your solution to the previous exercise as a starting point,
write some helper functions to help you tidy up CSV files that are formatted this way,
and then use those functions in an R script that tidies these three files
and stores the results in maternal_health/tidy/*.csv
.
Using the tidied maternal_health
data files from the previous exercise,
create and save a single tidy data set that has fraction of birth at health facilities,
fraction that have skilled attendants present,
and fraction of Caesarean sections
by country and year.
Using the combined data set created in the previous exercise, create a scatter plot showing the relationship in each country between the fraction of births at health facilities and the fraction that have skilled attendants present, coloring points according to the fraction of Caesarean sections.