++Most time in data analysis is spent ‘tidying up’ data: getting it +into a suitable format to get started. Data scientists have a particular +definition of tidy: Tidy datasets are “easy to manipulate, +model and visualize, and have a specific structure: each variable is a +column, each observation is a row” (Wickham 2014).
+
++It’s often not convenient for humans to enter data in a tidy +way, so untidy data is probably more common than tidy data in the wild. +But doing good, reproducible science demands that we document each step +of our processing in a way that others can check or repeat in future. +Tools like R make this easier.
+
Overview
+In the lifesavR
+worksheets we used various commands from the tidyverse
,
+like filter
and group_by
.
If you want to recap these commands you could use the +cheatsheet, especially the part on groups and summaries.
++
Today we will cover three additional techniques which are important +when working with real datasets:
+-
+
- ‘Pivoting’ or reshaping data from long to wide formats (or the +reverse) +
- Adding meaningful labels to categorical variables +
Before you start
+Make a new rmd
file in your datafluency directory,
+called data-wrangling.rmd
and record your work in this for
+the next 2 sessions.
Selecting columns
+The fuel
data also contains variables for weight and
+power.
We can select just these columns and save them to a smaller dataframe +like this:
+ +Explanation of the commands
+-
+
- On the far left we have the name of the new variable which we will
+create:
carperformance
.
+ - We can tell this will be a new variable because the
+
<-
symbol is just to the right, pointing at it.
+ - To see what
carperformance
contains, look to the right +of the<-
. We pipe thefuel
data to the +select
command, which selects +thempg
,weight
, andpower
+columns.
+
Explanation of the result
+When running the command you won’t see any output — but a new object
+was created called carperformance
which contained copies of
+the columns from fuel
we selected.
We can see the first few rows of our new smaller dataframe like +this:
+ + mpg weight power
+1 21.0 1188 110
+2 21.0 1304 110
+3 22.8 1052 93
+4 21.4 1458 110
+5 18.7 1560 175
+6 18.1 1569 105
+Try selecting columns in a dataset for yourself:
+-
+
- Use any of the built in datasets, creating a copy with just a subset +of 3 of its columns. +
Pivoting longer
+Data is commonly stored in either wide +or long format.
+If you used SPSS to do a t-test or ANOVA during your +undergraduate degree, you likely stored and analysed the data in +wide format.
+In wide format, each row represents the observations from a +single participant. Each measurement for a given participant are +stored in separate columns.
+This is often called row per subject data. An
+example is the built in attitude
dataset:
rating complaints privileges learning raises critical advance
+1 43 51 30 39 61 92 45
+2 63 64 51 54 63 73 47
+3 71 70 68 69 76 86 48
+4 61 63 45 47 54 84 35
+5 81 78 56 66 71 83 47
+6 43 55 49 44 54 49 34
+Explanation: Each row contains scores for a
+particular employee on various measures. To find out more about these
+data you can type ?attitude
into the console.
+
Let’s say we want a single plot of all these variables, something +like this:
+ +To do this we first need to convert the data to long format. +In long format, each observation is saved in its own +row, rather than across multiple columns.
+It’s often called “row per observation” data.
+Using pivot_longer()
+++ +Pivoting is where you take a long data file (lots of rows, few +columns) and make it wider. Or where you take a wide data file (lots of +columns, few rows) and make it longer.
+
We can convert from wide to long using the
+pivot_longer()
function, as shown in the video:
To see why the function is called ‘pivot_longer’, +imagine trying to reshape just the first two rows of the attitude +dataset:
+ rating complaints privileges learning raises critical advance
+1 43 51 30 39 61 92 45
+2 63 64 51 54 63 73 47
+If we use pivot_longer
on this selection, we end up with
+this:
# A tibble: 14 × 2
+ name value
+ <chr> <dbl>
+ 1 rating 43
+ 2 complaints 51
+ 3 privileges 30
+ 4 learning 39
+ 5 raises 61
+ 6 critical 92
+ 7 advance 45
+ 8 rating 63
+ 9 complaints 64
+10 privileges 51
+11 learning 54
+12 raises 63
+13 critical 73
+14 advance 47
+Explanation of the command: - We selected a subset
+of columns and rows. - Then we used
+pivot_longer(everything())
to make this into long form
+data. - The everything()
part tells R to merge values from
+all of the columns into a single new column called value
,
+and to keep track of the original variable name in a new column called
+name
.
The change works like like this:
++
You might have spotted a problem though: We don’t have a record of
+which participant was which in the attitude
dataset.
This is because the mapping to participants was implicit: +each row was a different participant, but participant number was +not actually recorded in the file.
+We can create an explicit participant identifier by adding a new
+column. For this we use the mutate
and
+row_number()
functions:
attitude_with_person <- attitude %>%
+ mutate(person = row_number()) %>%
+ head(2)
+
+attitude_with_person %>%
+ pander()
rating | +complaints | +privileges | +learning | +raises | +critical | +advance | +person | +
---|---|---|---|---|---|---|---|
43 | +51 | +30 | +39 | +61 | +92 | +45 | +1 | +
63 | +64 | +51 | +54 | +63 | +73 | +47 | +2 | +
Now we have a column called person
which stores the row
+number.
But this means if we pivot_longer()
+again, we will need to tell R which columns we would like to
+pivot.
If we don’t do this then the person
column gets melted
+with everything else so we lose track of which response belonged to
+which participant, like this:
name | +value | +
---|---|
rating | +43 | +
complaints | +51 | +
privileges | +30 | +
learning | +39 | +
raises | +61 | +
critical | +92 | +
advance | +45 | +
person | +1 | +
rating | +63 | +
complaints | +64 | +
privileges | +51 | +
learning | +54 | +
raises | +63 | +
critical | +73 | +
advance | +47 | +
person | +2 | +
Explanation of the output Because we didn’t tell
+pivot_longer
which columns we wanted to pivot, it put all
+the values into a single new column called value
. This
+included our participant identifier, person
which is not
+what we wanted.
+
We can exclude person
from the pivoting by writing:
person | +name | +value | +
---|---|---|
1 | +rating | +43 | +
1 | +complaints | +51 | +
1 | +privileges | +30 | +
1 | +learning | +39 | +
1 | +raises | +61 | +
1 | +critical | +92 | +
Explanation of the command and output:
+-
+
- Here, we still use
pivot_longer
but this time we put +-person
between the parentheses.
+ - The minus sign,
-
, means don’t include this +variable, so-person
ends up meaning include all +columns exceptperson
, which is what we wanted.
+ - The output now retains the
person
column, but pivots +the other variables.
+ - This means we can tell which person provided each datapoint. +
Use some tidyverse commands you already know
+(e.g. select
), plus pivot_longer
, to produce
+this plot using the attitude
dataset:
-
+
- Check the cheatsheet +if you get stuck +
- You need to select only the three variables shown +
- It’s not necessary to create a person identifier for this plot +(although it won’t hurt if you do) +
Pivoting to make summary tables
+Imagine we want a table of the mean score for each question in the +attitude dataset.
+This would be fiddly if we just tried to use summarise
+on wide format data. But if we use pivot_longer
,
+group_by
and then summarise
(in that order)
+it’s possible to take the data and make a table like this with 3
+instructions to R:
Name | +Mean | +SD | +
---|---|---|
advance | +42.93 | +10.29 | +
complaints | +66.6 | +13.31 | +
critical | +74.77 | +9.895 | +
learning | +56.37 | +11.74 | +
privileges | +53.13 | +12.24 | +
raises | +64.63 | +10.4 | +
rating | +64.63 | +12.17 | +
Combine the pivot_longer
, group_by
and
+summarise
commands (in that order) to reproduce the table
+above.
-
+
- You want to pivot all of the variables in the
+
attitude
dataset this time
+ - We covered using
summarise
in the third +lifesavR
session here.
+
We might also want to produce summary statistics per-participant.
+Using the commands shown above (and remembering to make a new column
+to store the participant number with row_number()
),
+reproduce this table:
person | +mean(value) | +
---|---|
1 | +51.57 | +
2 | +59.29 | +
3 | +69.71 | +
4 | +55.57 | +
5 | +68.86 | +
To make the table you will need to use the following functions, in +roughly this order:
+-
+
- mutate +
- filter +
- pivot_longer +
- group_by +
- summarise +
- pander +
Pivoting wider
+Sometimes we have the opposite problem: We have long data, but want +it in wide format. For example, we might want a table where it’s easy to +compare between different years, like this:
+development %>%
+ filter(year > 1990) %>%
+ pivot_wider(id_cols=country,
+ names_from=year,
+ values_from=gdp_per_capita) %>%
+ head(3) %>%
+ pander::pander("GDP per-capita in 3 countries in 4 different years, from the development dataset.")
country | +1992 | +1997 | +2002 | +2007 | +
---|---|---|---|---|
Afghanistan | +649.3 | +635.3 | +726.7 | +974.6 | +
Albania | +2497 | +3193 | +4604 | +5937 | +
Algeria | +5023 | +4797 | +5288 | +6223 | +
+
Instead of making the data longer, now we want to
+pivot_wider
.
The development
data is a fairly long format. There are
+multiple rows per-country corresponding to different years.
We want to compare GDP in different years.
+We first need to select the data we want — country
,
+year
and GDP
, for the years after 1990:
development1990s <- development %>%
+ select(country, year, gdp_per_capita) %>%
+ filter(year >= 1990)
Then we can pivot_wider()
:
development1990s %>%
+ pivot_wider(
+ names_from = year,
+ values_from = gdp_per_capita
+ ) %>%
+ head() %>%
+ pander()
country | +1992 | +1997 | +2002 | +2007 | +
---|---|---|---|---|
Afghanistan | +649.3 | +635.3 | +726.7 | +974.6 | +
Albania | +2497 | +3193 | +4604 | +5937 | +
Algeria | +5023 | +4797 | +5288 | +6223 | +
Angola | +2628 | +2277 | +2773 | +4797 | +
Argentina | +9308 | +10967 | +8798 | +12779 | +
Australia | +23425 | +26998 | +30688 | +34435 | +
Explanation of the command and output:
+-
+
- We started with multiple rows per country, corresponding to +years. +
- We used
pivot_wider
withnames_from = year
+to create new columns for each year in the data.
+ - We used
values_from=gdp_per_capita
to tell +pivot_longer
to use the GDP numbers to populate the +table.
+ - The resulting table helps us compare years within countries, or +between countries, for a given year. +
Use the funimagery
dataset in psydata
and
+perform the following:
-
+
- use
select
to make a dataset with +intervention
and each of thekg1
to +kg3
columns
+ - Use
pivot_longer
,group_by
and +summarise
to calculate the average weight of participants +at each timepoint
+ - Adapt the
group_by
function to calculate the mean at +each timepoint for each group separately
+ - Add
pivot_wider
to the end of your code to create a +separate column for each group.
+
When you finish your data should look like this:
+name | +MI | +FIT | +
---|---|---|
kg1 | +89.86 | +91.46 | +
kg2 | +88.62 | +86.37 | +
kg3 | +88.46 | +84.04 | +
Separating variables
+Sometimes we need to separate ‘untidy’ variables into tidy, long-form +data.
+ +The code below generates simulated data for 100 individuals at three +time points. The format is similar to the way you might record +experimental data in a spreadsheet.
+set.seed(1234)
+N <- 100
+repeatmeasuresdata <- tibble(person = 1:N,
+ time_1 = rnorm(N),
+ time_2 = rnorm(N, 1),
+ time_3 = rnorm(N, 3))
+
+repeatmeasuresdata %>% head(8) %>%
+ pander()
person | +time_1 | +time_2 | +time_3 | +
---|---|---|---|
1 | +-1.207 | +1.415 | +3.485 | +
2 | +0.2774 | +0.5253 | +3.697 | +
3 | +1.084 | +1.066 | +3.186 | +
4 | +-2.346 | +0.4975 | +3.701 | +
5 | +0.4291 | +0.174 | +3.312 | +
6 | +0.5061 | +1.167 | +3.76 | +
7 | +-0.5747 | +0.1037 | +4.842 | +
8 | +-0.5466 | +1.168 | +4.112 | +
This variable, repeatmeasuresdata
, is in
+wide format. Each row contains data for one
+participant, and each participant has three observations.
As we saw previously, +we can pivot — i.e., reshape — the data into longer format like +so:
+repeatmeasuresdata %>%
+ pivot_longer(starts_with("time")) %>%
+ arrange(person, name) %>%
+ head(8) %>%
+ pander()
person | +name | +value | +
---|---|---|
1 | +time_1 | +-1.207 | +
1 | +time_2 | +1.415 | +
1 | +time_3 | +3.485 | +
2 | +time_1 | +0.2774 | +
2 | +time_2 | +0.5253 | +
2 | +time_3 | +3.697 | +
3 | +time_1 | +1.084 | +
3 | +time_2 | +1.066 | +
The problem we have now is that name
contains text which
+describes at which time the observation was made. We probably want to
+store a number for each time-point, so we can make a plot with
+time on the x axis.
The separate
command separates a single character column
+(name
) into multiple columns. Rather than have a column
+with labels of the form ‘time_1’, it can create two columns, with labels
+‘time’ and ‘1’ in each.
# convert to long form; extract the `time` as a new numeric column
+longrepeatmeasuresdata <- repeatmeasuresdata %>%
+ pivot_longer(starts_with("time")) %>%
+ separate(name, into = c("variable", "time"))
+
+longrepeatmeasuresdata %>% head %>%
+ pander()
person | +variable | +time | +value | +
---|---|---|---|
1 | +time | +1 | +-1.207 | +
1 | +time | +2 | +1.415 | +
1 | +time | +3 | +3.485 | +
2 | +time | +1 | +0.2774 | +
2 | +time | +2 | +0.5253 | +
2 | +time | +3 | +3.697 | +
Now the data are in long format, we can plot the points over +time:
+ + +How does R know where to split the text?
+In the example above, separate
split data like
+"time_1"
, "time_2"
etc into two columns:
+variable
and time
.
Q: How did it know to use the underscore (_
) to split
+the data?
A: The default is to split on anything which is not a letter or
+number. So _
or a space, or ,
would all
+work.
Sometimes though we need to tell R explicitly what to use to sepatate +the values.
+If we had a column of email addresses we could split
+ben.whalley@plymouth.ac.uk
into the username
+(e.g. ben.whalley
) and domain name
+(plymouth.ac.uk
) using the @
symbol.
To do this we just write sep="@"
when we use
+separate.
The messy_exp
dataset in psydata
contains
+simulated RT data on 100 participants in 2 conditions (A and B) at three
+time points (1, 2, and 3).
-
+
- Use the
separate()
function to split up the +condition
variable in this dataset and draw the following +plot:
+
-
+
- This file contains sample contact and address data for 100 people: +https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv +
-
+
Read the data into R (you can either use the URL above directly +inside the
read_csv()
function, or download then re-upload +the data to the server to do this)
+Use the
separate
function to make a new variable +which contains the domain name of these contacts’ email address +(e.g. yahoo.com, hotmail.com)
+
++Note, you will need to use
+sep="@"
to split the email +addresses at the@
symbol
-
+
- Use the
distinct
and/orcount
functions on +the new variable you create containing the domain name. Look them up in +the help file if you don’t know which to use to answer these +questions:
+
-
+
- How many people had a Gmail account? +
- Which domains had more than 10 users? +
# read the data directly from the URL
+contacts <- read_csv('https://letterhub.com/wp-content/uploads/2018/03/100-contacts.csv') %>%
+ separate(email, into=c("user", "domain"), sep ="@") # uses the @ symbol as a separator
# how many _different_ domains are there?
+contacts %>%
+ distinct(domain) %>%
+ count() %>%
+ pander()
n | +
---|
39 | +
# how many people use gmail
+contacts %>%
+ count(domain) %>%
+ filter(domain=="gmail.com") %>%
+ pander()
domain | +n | +
---|---|
gmail.com | +16 | +
# which domains had more than 10 users?
+contacts %>%
+ count(domain) %>%
+ filter(n > 10) %>%
+ pander()
domain | +n | +
---|---|
aol.com | +13 | +
gmail.com | +16 | +
hotmail.com | +13 | +
yahoo.com | +13 | +
Questionnaire data
+The file sweets.csv
contains a small number of example
+rows of data exported from an online survey.
The file is at: https://t.ly/H9sDJ
+ +We can look at the first few rows of the data, using the
+glimpse
command:
Rows: 4
+Columns: 8
+$ ID <dbl> 6, 7, 8, 9
+$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12…
+$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:1…
+$ Email <chr> "anonymous", "anonymous", "anonymous…
+$ Name <lgl> NA, NA, NA, NA
+$ `How much do you like sweets?` <chr> "I don't like them", "I'm neutral", "…
+$ `How much do you like chocolate` <chr> "I don't like them", "I don't like th…
+$ Gender <chr> "M", "F", "M", "F"
+-
+
Import the sweets data as shown above from: https://t.ly/H9sDJ
+Save it to a new variable called
sweets
+
Tidying questionnaires
+When we look at the imported data it’s useful to note:
+-
+
There are extra columns we don’t need (at least for +now).
+Some of our variable names are very long and annoying to type +(for example
How much do you like sweets?
is the name of +one of our columns).
+Our responses are in text format, rather than as +numbers. For example, the data say
"I don't like them"
or +"I'm neutral"
rather than numbers from a 1-5 +scale.
+
We need to sort each of these problems to make things more manageable +for our analysis.
+Selecting and renaming
+++Remember, R makes using columns with spaces or other special +characters very hard. We want to avoid this.
+
Selecting
+To use columns with spaces in we must ‘escape’ the spaces and +let R know they are part of the name rather than a gap +between two different names.
+This video shows how (or read below):
+ +To escape spaces and use columns with long names we use the backtick +character (the backwards facing apostrophe) to wrap the column +name.
+In general, if your columns contain spaces or other odd +characters like hyphens or question marks then you will need to wrap +them in backticks.
+Renaming
+Some of the imported variable names in the sweets
data
+are long and awkward to use.
Most researchers would rename these variables, to make them more +usable in R code.
+You can rename variables like this:
+ +So for this example:
+sweets %>%
+ rename(
+ like_sweets = `How much do you like sweets?`,
+ like_chocolate = `How much do you like chocolate`,
+ )
# A tibble: 4 × 8
+ ID `Start time` `Completion time` Email Name like_sweets
+ <dbl> <dttm> <dttm> <chr> <lgl> <chr>
+1 6 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I don't like th…
+2 7 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral
+3 8 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I like them
+4 9 2019-05-24 09:12:31 2019-05-24 09:12:35 anonymous NA I'm neutral
+# ℹ 2 more variables: like_chocolate <chr>, Gender <chr>
+Explanation of the code: We used rename
+to change the names of our variables. We needed to wrap the long names
+of the questions in ‘backtick’ symbols to make sure R understood it was
+a single column name.
You should create a new variable to save the renamed dataset (with a +descriptive name for use later on):
+# create a new variable containing the renamed dataset
+sweets.renamed <- sweets %>%
+ rename(
+ like_sweets = `How much do you like sweets?`,
+)
-
+
Create a copy of the sweets data in which you have selected only +the two columns with long names.
+Create a second copy of the data where you have renamed the +columns with long names to something short, and without spaces.
+
Renaming with the janitor
package
+A good alternative to renaming variables manually is to use the
+clean_names
function in the janitor
+package.
Rows: 4
+Columns: 8
+$ id <dbl> 6, 7, 8, 9
+$ start_time <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:3…
+$ completion_time <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:…
+$ email <chr> "anonymous", "anonymous", "anonymous",…
+$ name <lgl> NA, NA, NA, NA
+$ how_much_do_you_like_sweets <chr> "I don't like them", "I'm neutral", "I …
+$ how_much_do_you_like_chocolate <chr> "I don't like them", "I don't like them…
+$ gender <chr> "M", "F", "M", "F"
+Explanation of the code and result. I used the
+clean_names
function within the janitor
+package without using library
. I did this by typing
+janitor::
and then the name of the function. In the result
+clean_names
has made a new dataset:
-
+
- Removed all special characters +
- Made everything lower case (easier for R to autocomplete) +
- Replaced spaces with underscores +
- Made column names unique (this isn’t always the case with imported +data, but is important for R) +
I typically use this function when importing any new data because it +makes the naming and access of columns much more consistent and easier +to remember.
+Recoding text
+We noticed above that our responses were stored as text labels like
+"I don't like them"
rather than on a numeric scale. This
+makes it hard to use in an analysis.
We need to recode the text variables into numeric +versions.
+How to do it
+First we must tell R what number we want to use for each text label. +That is, we create a mapping of numbers to +labels.
+This takes a few steps:
+-
+
- Check exactly what the text values are which need +to be mapped. +
- Make a mapping variable which assigns each text +value a number value +
- Use the
recode
function withmutate
to +create a new, numeric column
+
This video walks you through the steps below:
+ +Step 1: Check EXACTLY what text labels we have
+To check which labels we need to recode, I select the column in
+question and use the unique()
function.
# check exactly what text values are in the dataset?
+sweets %>%
+ select(`How much do you like sweets?`) %>%
+ unique()
# A tibble: 3 × 1
+ `How much do you like sweets?`
+ <chr>
+1 I don't like them
+2 I'm neutral
+3 I like them
+
+Step 2: Make a mapping variable
+We do this by creating what R calles a named vector, +which is a special kind of list.
+To make a named vector we use the the c()
function. The
+letter c
here just stands for ‘combine’ — i.e. ‘combine
+these things into a list’.
This is a simple example:
+ +We could then use this mapping to recode a column of data which +contained the words “No” or “Yes
+A useful trick when creating your own mappings is to use R +to do the formatting for you (see the video above for a demo).
+Re-using the code from the previous step, we use
+unique()
to show us the unique values for the
+question about sweets.
We then pipe the result to the paste()
and
+cat()
functions, like this:
# the hack we use as a short-cut to creating a mapping variable
+sweets %>%
+ select(`How much do you like sweets?`) %>%
+ unique() %>%
+ paste() %>% cat()
c("I don't like them", "I'm neutral", "I like them")
+Explanation of the output: Using paste
+and cat
is a bit of a hack. When we run this code we see
+the output
+c("I don't like them", "I'm neutral", "I like them")
. This
+is a list of the values in the sweets
data for this
+question, formatted in a way that will be useful to us in the next
+step.
We then copy and paste this output into a NEW code block, and EDIT it +to assign our mappings:
+ +Explanation of the code: We used the previous output
+to create a mapping. By adding the parts which read = -1
+and = 0
etc, we have told R what value we want to assign
+for each label.
Q: How do you know what number values to assign?
+A: It doesn’t matter, provided:
+-
+
- The intervals between each options are the same and +
- Each text value has a different number +
So, if we had a Likert-scale ranging from “Completely agree” to
+“Completely disagree” in 7 increments, we could score this from
+0 - 6
or 1 - 7
, or -3 - 3
. These
+would all be fine.
Step 3: Use the mapping variable to recode the column
+We can use our new mapping with the mutate
and
+recode
functions to make a new column,
+containing numbers rather than text:
sweets.recoded <- sweets %>%
+ rename(
+ like_sweets = `How much do you like sweets?`,
+ like_chocolate = `How much do you like chocolate`,
+ ) %>%
+ # use recode to convert text response using preference.mappings
+ mutate(
+ like_sweets_numeric =
+ recode(like_sweets, !!!preference.mappings)
+ )
We can see this new column if we use glimpse
:
Rows: 4
+Columns: 9
+$ ID <dbl> 6, 7, 8, 9
+$ `Start time` <dttm> 2019-05-24 09:12:31, 2019-05-24 09:12:31, 2019-05-…
+$ `Completion time` <dttm> 2019-05-24 09:12:35, 2019-05-24 09:12:35, 2019-05…
+$ Email <chr> "anonymous", "anonymous", "anonymous", "anonymous"
+$ Name <lgl> NA, NA, NA, NA
+$ like_sweets <chr> "I don't like them", "I'm neutral", "I like them"…
+$ like_chocolate <chr> "I don't like them", "I don't like them", "I'm neu…
+$ Gender <chr> "M", "F", "M", "F"
+$ like_sweets_numeric <dbl> -1, 0, 1, 0
+Explanation of the code:
+-
+
- The start of the first line is
sweets.recoded <-
+which means make a new variable calledsweets.recoded
.
+ - Then we use
mutate
to create a new +column calledlike_sweets_numeric
.
+ - We make this column using
recode
on the question about +liking sweets.
+ - We use the
preference.mappings
mapping to specify what +numeric score to give each of the text values.
+
Watch out for the exclamation marks!!!
:
+In the code above there are three exclamation marks, !!!
,
+before the mapping; make sure you do the same.
Three exclamation marks: !!!
+Understanding this isn’t necessary to get on with the course. +Only read this if you are interested!
+In the code above when we used recode
we used three
+exclamation marks just before our list.
We defined the mapping:
+likert.responses <- c(
+ "I hate them" = 1,
+ "I don't like them" = 2,
+ "I'm neutral" = 3,
+ "I like them" = 4,
+ "I can't live without them" = 5)
And then used it with recode
, with the three exclamation
+marks.
liking_of_sweets_data %>%
+ mutate(like_sweets_numeric = recode(like_sweets_text, !!!likert.responses)) %>%
+ pander()
The reason for this is that recode
actually expects us
+to specify the mapping for it like this:
liking_of_sweets_data %>%
+ mutate(like_sweets_numeric = recode(like_sweets_text,
+ "I hate them" = 1,
+ "I don't like them" = 2 ...))
But this means we have to repeat the mapping for each of the +questions. Because all the questions use the same mapping this gets +repetitive, and can lead to errors.
+The three exclamation marks !!!
unpacks the
+list for us. So writing !!!likert.responses
saves us the
+bother of writing it out in full each time.
Summary/video explanation
+This is one of the trickiest bits of R code we use. I’ve included an +annotated video of just these steps as a reference below:
+ +-
+
Use this three-step process to create a recoded version of the +
like_chocolate
variable.
+Remember to watch the video at the start of this section, or the +short version in the green box above, if anything is unclear.
+
Combining scores
+Often questionnaires are designed to make repeated measurements of +the same phenomena, which can then be summed or averaged to create a +more reliable measure.
+We’ve already seen how mutate()
creates a new column. We
+can use this again to create the sum of responses to both
+questions:
sweets.recoded %>%
+ # mutate to create column containing sum of both recoded questions
+ mutate(liking = like_sweets_numeric + like_chocolate_numeric) %>%
+ # select the new variable we created
+ select(ID, liking) %>%
+ pander()
ID | +liking | +
---|---|
6 | +-2 | +
7 | +-1 | +
8 | +1 | +
9 | +0 | +
Explanation of the code:
+-
+
- We added both the questions which asked about ‘liking’ together.
+This created a new column containing the combined score, called
+
liking
.
+ - In the third line we selected only the
ID
column, plus +the new column we made.
+
Using ChatGPT to automate things
+In the sections above I explained how to tidy up a simple dataset. +However it turns out many of these tasks are easily solved using +ChatGPT.
+You now have a choice you didn’t have a few years ago:
+-
+
- Learn how to do these steps by hand (and sometimes use ChatGPT to +automate the work) +
- Learn just enough to be able to get ChatGPT to write code for you +(and hope it’s correct) +
I would recommend option 1 (at least until LLMs make improvements in +code generation) because it will likely be faster, more accurate and +will promote your understanding.
+But as a guide, here’s how I might solve this problem with ChatGPT as +an assistant.
+Step 1. Show ChatGPT the data
+I use these command to output some information about the data, in a +format that’s easy to copy into ChatGPT.
+ +Step 2: Ask GPT to recode it for us
+This link shows how I asked ChatGPT to recode the sweets
+data:
https://chat.openai.com/share/325b1f77-c68c-417a-8e75-76b05a6b9633
+Note that I:
+-
+
- told it the names of the columns +
- showed it a sample of the data (pasted in the output of
+
glimpse
)
+ - explicitly listed the values in the columns I wanted to recode
+(pasted in output of
count
)
+ - directly asked for R code +
- gave instructions to use the tidyverse (this makes the output more +consistent with what you have learned) +
Step 3: Debug and validate the R code GPT provides
+ +In this example, ChatGPT provided the following code:
+library(tidyverse)
+
+# Load the sweets dataset (assuming it's already loaded)
+sweets <- read_csv('https://t.ly/H9sDJ')
+
+# Recode the text responses into numeric variables
+sweets <- sweets %>%
+ mutate(Like_sweets = recode(`How much do you like sweets?`,
+ "I don't like them" = 0,
+ "I like them" = 1,
+ "I'm neutral" = 2
+ )) %>%
+ mutate(Like_chocolate = recode(`How much do you like chocolate`,
+ "I don't like them" = 0,
+ "I like them" = 1,
+ "I'm neutral" = 2
+ )) %>%
+ select(-`How much do you like sweets?`, -`How much do you like chocolate`)
+
+# Rename variables with spaces to use underscores and be shorter
+sweets <- sweets %>%
+ rename(
+ ID = `ID`,
+ Start_time = `Start time`,
+ Gender = `Gender`
+ )
+
+# Remove timestamp and email columns
+sweets <- sweets %>%
+ select(-`Start_time`, -`Email`)
+
+# Check the modified dataset
+glimpse(sweets)
Amazingly, this actually works!
+However it’s worth saying that it often doesn’t work on +first try, and it is dependent on writing a good prompt to the model, +and giving it plenty of information about the data and what you want to +achieve.
+Some tips:
+-
+
- Think first. Be clear about what you want to achieve. +
- Explain in detail what the data are like and show examples. +
- Give explicit instructions. +
- If code doesn’t work, ask for explanations or corrections. +
- Break tasks down into smaller steps (ask for less and build up your +script). +
- Paste in the code you already have and ask for a completion with one +extra step (e.g. something you don’t know how to do). +
- If you can afford it, use ChatGPT 4 or try new R-specific LLMs as +they are (inevitably) released. +
Where there are bugs in ChatGPT’s code, you can
+-
+
- fix these bugs by hand, or +
- run the code, report the errors to ChatGPT and ask it to fix them +for you. +
The second option has some risks (ChatGPT’s code might +delete your data, although it’s unlikely), and is probably slower. In +this instance it took me quite a while to get ChatGPT to fix the simple +errors, and this required understanding the code to some degree +anyway.
+As I recommend in the lectures, knowing how to use R well enough to +fix the small errors introduced is likely to be more productive in the +long run. Think of LLMs as your assustant rather than your +replacement.
+Consolidation activity
+Use this example dataset: https://t.ly/_Tcjo
+-
+
- Read in the data. +
- Rename the long column names to something short, and without +spaces +
- Recode at least three of the columns with data about sleep quality +to be numeric +
- Save the result of this work in a new variable called
+
sleep.tidy
+ - Pivot the recoded variables and make a boxplot of responses to these +questions +
- Create a summary score called
sleep_quality
which is +the sum of these recoded questions (use mutate)
+ - Create a density plot of this summary score and interpret what you +see (describe the pattern in plain English) +
sleep <- read_csv('https://t.ly/_Tcjo')
+
+# used to check what response values are in each question
+sleep %>%
+ select(`My sleep is affected by my study commitments`) %>%
+ unique() %>% paste %>% cat
c("Agree", "Somewhat agree", "Somewhat disagree", "Disagree", "Neither agree nor disagree", "Strongly agree", "Strongly disagree")
+sleep %>%
+ select(`My electronic device usage negatively affects my sleep`) %>%
+ unique() %>% paste %>% cat
c("Disagree", "Strongly agree", "Somewhat disagree", "Agree", "Somewhat agree", "Neither agree nor disagree", "Strongly disagree")
+# we will use the same mapping for both questions because they have the same responses
+sleep.map <- c("Agree"=2,
+ "Somewhat agree"=1,
+ "Somewhat disagree"=-1,
+ "Disagree"=-2,
+ "Neither agree nor disagree"=0,
+ "Strongly agree"=3,
+ "Strongly disagree"=-3)
+
+
+sleep.tidy <- sleep %>%
+ # now we recode the two text variables (we only need use mutate once though)
+ mutate(
+ sleep_study = recode(`My sleep is affected by my study commitments`, !!!sleep.map),
+ sleep_electronic = recode(`My electronic device usage negatively affects my sleep`, !!!sleep.map)
+ )
# now we can pivot longer to make a plot
+sleep.tidy %>%
+ pivot_longer(c(sleep_study, sleep_electronic)) %>%
+ ggplot(aes(name, value)) + geom_boxplot()
And make a summary score combining both questions
+sleep.tidy.withsumary <- sleep.tidy %>%
+ # and create the summary score
+ mutate(sleep_quality = sleep_study + sleep_electronic )
+
+# check the result. it looks ok
+sleep.tidy.withsumary %>% glimpse
Rows: 241
+Columns: 13
+$ uniqueid <chr> …
+$ `Start time` <dttm> …
+$ `Completion time` <dttm> …
+$ `My sleep is affected by my study commitments` <chr> …
+$ `I achieve good quality sleep` <chr> …
+$ `My electronic device usage negatively affects my sleep` <chr> …
+$ `Tiredness interferes with my concentration` <chr> …
+$ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> …
+$ `I often achieve eight hours of sleep` <chr> …
+$ `I regularly stay up past 11pm` <chr> …
+$ sleep_study <dbl> …
+$ sleep_electronic <dbl> …
+$ sleep_quality <dbl> …
+
+
+Check your knowledge
+-
+
What does it mean for data to be “tidy”? Identify the three key +characteristics
+What function creates a new column in a dataset?
+Which function allows you to choose a subset of the columns in a +dataset?
+What does
pivot_longer
do?
+Why is
pivot_longer
useful when we want to make a +faceted plot?
+Why is wide data sometimes more useful than long data?
+If you have RT data with 10 groups, what tidyverse ‘verbs’ +(functions) would you use to calculate the mean for each group?
+How can you read data into a dataframe from over the +internet?
+Does
recode
convert from text to numeric, or from +numeric to text values?
+Why is it important to recode text variables into numeric values +when working with survey or questionnaire data?
+Why is it important to copy and paste exact values when making a +mapping variable for
recode
?
+When renaming, is it:
rename(oldvarname=newvarname)
+orrename(newvarname=oldvarname)
?
+What does the
clean_names
function from the janitor +package do?
+Imagine a single cell of your dataset contains the string +“conditionA_time_1”. What function should we apply to it?
+