-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy path013-management.qmd
289 lines (176 loc) · 56 KB
/
013-management.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
{{< include _setup.qmd >}}
# Project management {#sec-management}
::: {.callout-note title="learning goals"}
* Manage your research projects efficiently and transparently
* Develop strategies for data organization
* Optimize sharing of research products, like data and analysis code, by ensuring they are findable, accessible, interoperable, reusable (FAIR)
* Discuss potential ethical constraints on sharing research products
:::
> Your closest collaborator is you six months ago, but you don't reply to emails.
>
> ---Karl Broman [-@broman2015], quoting \@gonuke on Twitter
Have you ever returned to an old project folder to find a chaotic mess of files with names like `analysis-FINAL`, `analysis-FINAL-COPY`, and `analysis-FINAL-COPY-v2`? Which file is actually the final version!? Or perhaps you've spent hours searching for a data file to send to your advisor, only to realize with horror that it was *only* stored on your old laptop---the one that experienced a catastrophic hard drive failure when you spilled coffee all over it one sleepy Sunday morning. These experiences may make you sympathetic to Karl Broman's quip above. Good project management practices not only make it easier to share your research with others, they also make for a more efficient and less error-prone workflow that will avoid giving your future self a headache. This chapter is about the process of managing all of the products of your research workflow---methodological protocols, materials,^[We use the term "materials" here to cover a range of things another researcher might need in order to repeat your study---for example, stimuli, survey instruments, and code for computer-based experiments.] data, and analysis scripts. We focus especially on managing projects in ways that maximize their value to you and to the broader research community by aligning with open science\indexC{open science} practices (maximizing [transparency]{.smallcaps}).
![Poor file management creates chaos! "Documents" by xkcd (<https://xkcd.com/1459>, licensed under <https://xkcd.com/license.html>).](images/management/versions_xkcd.png){#fig-versions .column-margin width="80%" fig-alt="A comic where one figure looks over at another's computer screen, which contains a list of messy document names."}
When we talk about research products, we typically think of articles in academic journals, which have been scientists' main method of communication since the scientific revolution in the 1600s.^[The world's oldest scientific journal is the *Philosophical Transactions of the Royal Society*, first published in 1665.] But articles only provide written summaries of research; they are not the original research products. In recent years, there have been widespread calls for increased sharing of research products, such as materials, data, and analysis code [@munafo2017]. When shared appropriately, these other products can be as valuable as a summary article: Shared stimulus materials can be reused for new studies in creative ways; shared analysis scripts can allow for reproduction of reported results and become templates for new analyses; and shared data can enable new analyses or meta-analyses. Indeed, many funding agencies, and some journals, now require that research products be shared publicly, except when there are justified ethical or legal constraints, such as with sensitive medical data [@nosek2015].
Data sharing, in particular, has been the focus of intense interest. Sharing data is associated with benefits in terms of error detection [@hardwicke2021d], creative reuse that generates new discoveries [@voytek2016], increased citations [@piwowar2013], and detection of fraud [@simonsohn2013]. According to surveys, researchers are usually willing to share data in principle [@houtkoop2018], but unfortunately, in practice, they often do not, even if you directly ask them [@hardwicke2018c]. Often authors simply do not respond, but if they do, they frequently report that data have been lost because they were stored on a misplaced or damaged drive, or team members with access to the data are no longer contactable [@tenopir2020].
As we have discussed in @sec-replication, even when data are shared, they are not always formatted in a way that they can be easily understood and reused by other researchers, or even the original authors! This issue highlights the critical role of **metadata**:\indexC{metadata} information that documents the data (and other products) that you share, including README files, **codebooks**\indexC{codebook} that document datasets themselves, and licenses that provide legal restrictions on reuse. We will discuss best practices for metadata throughout the chapter.
![An illustration of the analytic chain from raw data through to research report.](images/management/chain.png){#fig-management-chain .column-margin fig-alt="A flowchart from raw data to raw digital data to processed data to quantitative results to research reports."}
Sound project management practices and sharing of research projects are mutually reinforcing goals that bring benefits for both yourself, the broader research community, and scientific progress. One particularly important benefit of good project management practices is that they enable reproducibility. As we discussed in @sec-replication, computational reproducibility involves being able to trace the provenance\indexC{provenance} of any reported analytic result in a research report back to its original source. That means being able to recreate the entire analytic chain from data collection to data files, though analytic specifications to the research results reported in text, tables, and figures. If data collection is documented appropriately, and if data are stored, organized, and shared, then the provenance of a particular result is relatively easy to verify. But once this chain (@fig-management-chain) is broken, it can be hard to reconstruct [@hardwicke2018b]. That's why it's critical to build good project management practices into your research workflow right from the start.
In this chapter, you will learn how to manage your research project both efficiently and transparently.^[This chapter---especially the last section---draws heavily on @klein2018, an article on research transparency that several of us contributed to.] Working toward these goals can create a virtuous cycle: if you organize your research products well, they are easier to share later, and if you assume that you will be sharing, you will be motivated to organize your work better! We begin by discussing some important principles of project management, including folder structure, file naming, organization, and version control. Then we zoom in specifically on data and discuss best practices for data sharing. We end by discussing the question of what research products to share and some of the potential ethical issues that might limit your ability to share in certain circumstances.
::: {.callout-note title="case study"}
### ManyBabies, ManySpreadsheetFormats! {-}
The ManyBabies\indexC{ManyBabies} project is an example of "Big Team Science"\indexC{big team science} in psychology. A group of developmental psychology researchers (including some of us) were worried about many of the issues of reproducibility, replicability, and experimental methods that we've been discussing throughout this book, so they set up a large-scale collaboration to replicate key effects in developmental science. The first of these studies was ManyBabies 1 [@manybabies2020], a study of infants' preference for baby-talk (also known as "infant directed speech").
The core team expected a handful of labs to contribute, but after a year-long data collection period, they ended up receiving data from 69 labs around the world! The outpouring of interest signaled a lot of enthusiasm from the community for this kind of collaborative science. Unfortunately, it also made for a tremendous data management headache. All kinds of complications and hilarity ensued as the idiosyncratic data formatting preferences of the various labs were reorganized to fit into a single standardized analysis pipeline [@byers-heinlein2020].\indexC{analysis pipeline}
All of the specific formatting changes that individual labs made were reasonable---altering column names for clarity, combining templates into a single Excel file, changing units (e.g., from seconds to milliseconds)---but together they created a very challenging **data validation**\indexC{data validation} problem for the core analysis team, requiring many dozens of hours of coding and hand-checking. The data checking was critical: an error in one lab's data was flagged during validation and led to the painful decision to drop those data from the final dataset. In future ManyBabies\indexC{ManyBabies} projects, the group has committed to using shared data validation software (<https://manybabies.org/validator>) to ensure that data files uploaded by individual labs conform to a shared standard.
:::
## Principles of project management
A lot of project management problems can be avoided by following a very simple file organization system.^[We're going to talk in this chapter about managing research products, which is one important part of project management. We won't talk about some other aspects of managing projects such as calendaring, managing tasks, or project communications. These are all important, they are just a bit out of scope for a book on doing experiments!] For those researchers who "grew up" managing their files locally on their own computers and emailing colleagues versions of data files and manuscripts with names like `manuscript-FINAL-JS-rev1.docx`, a few aspects of this system may seem disconcerting. However, with a little practice, this new way of working will start to feel intuitive and have substantial benefits.
Here are the principles:
1. There should be exactly one definitive copy of each document in the project, with its name denoting what it is. For example, `fifo_manuscript.Rmd` or `fifo_manuscript.docx` is the write-up of the "fifo" project as a journal manuscript.
2. The location of each document should be within a folder that serves to uniquely identify the document's function within the project. For example,\newline `analysis/experiment1/eye_tracking_preprocessing.Rmd` is clearly the file that performs preprocessing for the analysis of eye-tracking data from experiment 1.
3. The full project should be accessible to all collaborators via the cloud, either using a version control platform (e.g., GitHub) or another cloud storage provider (e.g., Dropbox, Google Drive).
4. The revision history of all text and text-based documents (minimally, data, analysis code, and manuscript files) should be archived so that prior versions are accessible.
Keeping these principles in mind, we discuss best practices for project organization, version control, and file naming.
### Organizing your project
To the greatest extent possible, all files related to a project should be stored in the same project folder (with appropriate subfolders), and on the same storage provider. There are cases where this is impractical due to the limitations of different software packages. For example, in many cases, a team will manage its data and analysis code via GitHub but decide to write collaboratively using Google Docs, Overleaf, or another collaborative platform. (It can also be hard to ask all collaborators to use a version control system they are unfamiliar with.) In that case, the final paper should still be linked in some way to the project repository.^[The biggest issue that comes up in using a split workflow like this is the need to ensure reproducible written products, a process we cover in @sec-writing.]
@Fig-management-organization-ex shows an example project stored on the Open Science Framework.\indexC{Open Science Framework (OSF)} The top-level folder contains subfolders for analyses, materials, raw and processed data (kept separately). It also contains the paper manuscript and, critically, a README file in a text format that describes the project. A README is a great way to document any other metadata\indexC{metadata} that the authors would like to be associated with the research products, for example a license, explained below.
![Sample top-level folder structure for a project. From @klein2018. Original visible on the Open Science Framework (<https://osf.io/xf6ug>).](images/management/org-ex.png){#fig-management-organization-ex .margin-caption width=70% fig-alt="A screenshot of file structure in OSF Storage, with folders Analyses, Material, Processed data, Raw data and a README file."}
There are many reasonable ways to organize the subfolders of a research project, but the broad categories of materials, data, analysis, and writing are typically present.^[We like the scheme followed by Project TIER (<https://www.projecttier.org>), which provides very clear guidance about file structure and naming conventions. TIER is primarily designed for a copy-and-paste workflow, which is slightly different from the "dynamic documents" workflow that we primarily advocate for (e.g., using R Markdown or Quarto as in @sec-rmarkdown).] In some projects---such as those involving multiple experiments or complex data types---you may have to adopt a more complex structure. In many of our projects, it's not uncommon to find paths like `/data/raw_data/exp1/demographics`. The key principle is to create a hierarchical structure in which subfolders uniquely identify the part of the broader space of research products that are found inside them---that is, `/data/raw_data/exp1` contains all the raw data from experiment 1, and `/data/raw_data/exp1/demographics` contains all the raw *demographics* data from that particular experiment.
### Versioning
Probably everyone who has ever collaborated electronically has experienced the frustration of editing a document, only to find out that you are editing the wrong version---perhaps some of the problems you are working on have already been corrected, or perhaps the section you are adding has already been written by someone else. A second common source of frustration comes when you take a wrong turn in a project, perhaps by reorganizing a manuscript in a way that doesn't work or refactoring code in a way that turns out to be short-sighted.
These two problems are solved by modern version control systems. Here we focus on the use of **Git**\indexC{Git}, which is the most widely used version control system. Git is a great general solution for version control, but many people---including several of us---don't love it for collaborative manuscript writing. We'll introduce Git and its principles here, while noting that online collaboration tools like Google Docs and Overleaf^[Overleaf is actually supported by Git on the backend!] can be easier for writing prose (as opposed to code); we cover this topic in a bit more depth in @sec-writing.
![A visualization of Git version control showing a series of commits (circles) on three different branches: the main branch (green) and two others (blue and red). Branches can be created and then merged back into the main branch.](images/management/git.png){#fig-management-git .column-margin fig-alt="A diagram of connected circles where \"your work\" and \"someone else's work\" branch off of \"main branch\" then merge back in."}
Git\indexC{Git} is a tool for creating and managing projects, which are called **repositories**. A Git repository is a directory whose revision history is tracked via a series of **commits**---snapshots of the state of the project. These commits can form a tree with different **branches**, as when two contributors to the project are working on two different parts simultaneously (@fig-management-git). These branches can later be **merged** either automatically or via manual intervention in the case of conflicting changes.
Commonly, Git\indexC{Git} repositories are hosted by an online service like [GitHub](https://github.com) to facilitate collaboration. With this workflow, a user makes changes to a local version of the repository on their own computer and **pushes** those changes to the online repository. Another user can then **pull** those changes from the online repository to their own local version. The online "origin" copy is always the definitive copy of the project, and a record is kept of all changes. @Sec-git provides a practical introduction to Git and GitHub, and there are a variety of good tutorials available online and in print [@blischak2016].
Collaboration using version control tools is designed to solve many of the problems we've been discussing:
* A remotely hosted Git\indexC{Git} repository is a cloud-based backup of your work, meaning it is less vulnerable to accidental erasure.^[In 48 BC, Julius Caesar accidentally burned down part of the Library of Alexandria where the sole copies of many ancient works were stored. To this day, many scientists have apparently retained the habit of storing single copies of important information in vulnerable locations. Even in the age of cloud computing, hard drive failure is a surprisingly common source of problems!]
* By virtue of having versioning history, you have access to previous drafts in case you find you have been following a blind alley and want to roll back your changes.
* By creating new branches, you can create another, parallel history for your project so that you can try out major changes or additions without disturbing the main branch in the process.
* A project's commit history is labeled with each commit's author and date, facilitating record-keeping and collaboration.
* Automatic merging can allow synchronous editing of different parts of a manuscript or codebase.^[Version control isn't magic, and if you and a collaborator edit the same line(s), you will have to merge your changes by hand. But Git will at least show you where the conflict is!]
Organizing a project repository for collaboration and hosting on a remote platform is an important first step toward sharing! Many of our projects (like this book) are actually *born open*: we do all of our work on a publicly hosted repository for everyone to see [@rouder2015]. This philosophy of "working in the open" encourages good organization practices from the beginning. It can feel uncomfortable at first, but this discomfort soon vanishes as you realize that basically no one is looking at your in-progress project.
One concern that many people raise about sharing in-progress research openly is the possibility of "scooping"---that is, other researchers getting an idea or even data from the repository and writing a paper before you do. We have two responses to this concern. First, the empirical frequency of this sort of scooping is difficult to determine but likely very low---we don't know of any documented cases. Mostly, the problem is getting people to care about your experiment at all, not people caring so much that they would publish using your data or materials! In Gary King's words [@king2013], "The thing that matters the least is being scooped. The thing that matters the most is being ignored." On the other hand, if you are in an area of research that you perceive to be competitive, or where there is some significant risk of this kind of shenanigans, it's very easy to keep part, or all, of a repository private among your collaborators until you are ready to share more widely. All of the benefits we described still accrue. For an appropriately organized and hosted project, often the only steps required to share materials, data, and code are (1) to make the hosted repository public and (2) to link it to an archival storage platform like the Open Science Framework.\indexC{Open Science Framework (OSF)}
### File names
As [Phil Karlton reportedly said,](https://www.karlton.org/2017/12/naming-things-hard) "There are only two hard things in Computer Science: cache invalidation and naming things." What's true for computer science is true for research in general.^[We won't talk about cache invalidation; that's a more technical problem in computer science that is beyond the scope of this book.] Naming files is hard! Some very organized people survive on systems like `INFO-r1-draft-2020-07-13-js.docx`, meaning "the INFO project revision 1 draft of July 13th, 2020, with edits by JS." But this kind of system needs a lot of rules and discipline, and it requires everyone in a project to buy in completely.
On the other hand, if you are naming a file in a hierarchically organized version control repository, the naming problem gets dramatically easier. All of a sudden, you have a context in which names make sense. `data.csv` is a terrible name for a data file on its own. But the name is actually perfectly informative---in the context of a project repository with a README that states that there is only a single experiment, a repository structure such that the file lives in a folder called `raw_data`, and a commit history that indicates the file's commit date and author.
As this example shows, naming is hard *out of context*. So here's our rule: name a file with what it contains. Don't use the name to convey the context of who edited it, when, or where it should go in a project. That is metadata\indexC{metadata} that the platform should take care of.^[The platform won't take care of it if you email it to a collaborator---precisely why you should share access to the full *platform*, not just the out-of-context file!]
## Data management
We've just discussed how to manage projects in general; in this section we zoom in on datasets specifically. Data are often the most valuable research product because they represent the evidence generated by our research. We maximize the value of the evidence when other scientists can reuse it for independent verification or generation of novel discoveries. Yet, lots of research data are not reusable, even when they are shared. In @sec-replication, we discussed Hardwicke et al.'s [-@hardwicke2018b] study of *analytic* reproducibility. But before we were even able to try and reproduce the analytic results, we had to look at the data. When we did that, we found that only 64% of shared datasets were both complete and understandable.
How can you make sure that your data are managed so as to enable effective sharing? We make four primary recommendations:
1. save your raw data
2. document your data collection process
3. organize your raw data for later analysis
4. document your data using a codebook\indexC{codebook} or other metadata\indexC{metadata}
\noindent Let's look at each in turn.
### Save your raw data
Raw data take many forms. For many of us, the raw data are those returned by the experimental software; for others, the raw data are videos of the experiment being carried out. Regardless of the form of these data, save them! They are often the only way to check issues in whatever processing pipeline brings these data from their initial state to the form you analyze. They also can be invaluable for addressing critiques or questions about your methods or results later in the process. If you need to correct something about your raw data, *do not alter the original files*. Make a copy, and make a note about how the copy differs from the original.^[Future you will thank present you for explaining why there are two copies of subject 19's data after you went back and corrected a typo.]
Raw data are often not anonymized, or even anonymizable. Anonymizing them sometimes means altering them (e.g., in the case of downloaded logs from a service that might include IDs or IP addresses). Or in some cases, anonymization\indexC{anonymization} is difficult or impossible without significant effort and loss of some value from the data, for example, for video data or MRI data [@bischoff-grethe2007]. Unless you have specific permission for broad distribution of these identifiable data, the raw data may then need to be stored in a different way. In these cases, we recommend saving your raw data in a separate repository with the appropriate permissions. For example, in the ManyBabies\indexC{ManyBabies} 1 study we described above, the public repository does not contain the raw data contributed by participating labs, which the team could not guarantee was anonymized; these data are instead stored in a private repository.^[The precise repository you use for this task is likely to vary by the kind of data that you're trying to store and the local regulatory environment. For example, in the United States, to store de-anonymized data with certain fields requires a server that is certified for HIPAA (the relevant privacy law). Many---but by no means all---universities provide HIPAA-compliant cloud storage.\indexC{HIPAA (Health Insurance Portability and Accountability Act)}]
You can use your repository's README to describe what is and is not shared. For example, a README might state, "We provide anonymized versions of the files originally downloaded from Qualtrics\indexC{Qualtrics}" or "Participants did not provide permission for public distribution of raw video recordings, which are retained on a secure university server." Critically, if you share the derived tabular data, it should still be possible to reproduce the analytic results in your paper, even if checking the provenance\indexC{provenance} of those numbers from the raw data is not possible for every reader.^[One way we organize the raw data in some of our paper is to have three different subfolders in the `data/` directory: `raw/`, for the original data; `processed/`, for the anonymized or otherwise preprocessed data; and `/scripts`, for the code that does the preprocessing. Since these folders are in a Git repository, we can then add `raw/*` to the `.gitignore` file, ensuring that they are never added to the public version of the repository even though they sit within our local file hierarchy in the appropriate place.]
One common practice is the use of participant identifiers to link specific experimental data---which, if they are responses on standardized measures, rarely pose a significant identifiability risk---to demographic data sheets that might include more sensitive and potentially identifiable data.^[A word about subject identifiers. These should be anonymous identifiers, like randomly generated numbers, that cannot be linked to participant identities (like date of birth) and are unique. You laugh, but one of us was in a lab where all the subject IDs were the date of test and the initials of the participant. These were neither unique nor anonymous. One common convention is to give your study a code-name and to number participants sequentially, so your first participant in a sequence of experiments on information processing might be `INFO-1-01`.] Depending on the nature of the analyses being reported, the experimental data can then be shared with limited risk. Then a selected set of demographic variables---for example, those that do not increase privacy risks but are necessary for particular analyses---can be distributed as a separate file and joined back into the data later.
### Document your data collection process
To understand the meaning of the raw data, it's helpful to share as much as possible about the context in which they were collected. This practice also helps communicate the experience that participants had in your experiment. Documentation of this experience can take many forms.
If the experimental experience was a web-based questionnaire, archiving this experience can be as simple as downloading the questionnaire source.^[If it's in a proprietary format like a Qualtrics\indexC{Qualtrics} `.QSF` file, a good practice is to convert it to a simple plain text format as well so it can be opened and reused by folks who do not have access to Qualtrics (which may include future you!).] For more involved studies, it can be more difficult to reconstruct what participants went through. This kind of situation is where video data can shine [@gilmore2017]. A video recording of a typical experimental session can provide a valuable tutorial for other experimenters---as well as good context for readers of your paper. This is doubly true if there is a substantial interactive element to your experimental experience, as is often the case for experiments with children. For example, in our ManyBabies\indexC{ManyBabies} [case study]{.smallcaps}, the project shared ["walk-through" videos of experimental sessions](https://nyu.databrary.org/volume/896) for many of the participating labs, creating a repository of standard experiences for infant development studies. If nothing else, a video of an experimental session can sometimes be a very nice archive of a particular context.^[Videos of experimental sessions also are great demos to show in a presentation about your experiment, provided you have permission from the participant.]
Regardless of what specific documentation you keep, it's critical to create some record linking your data to the documentation. For a questionnaire study, for example, this documentation might be as simple as a README that says that the data in the `data/raw/` directory were collected on a particular date using the file named `experiment1.qsf`. This kind of "connective tissue" linking data to materials can be very important when you return to a project with questions. If you spot a potential error in your data, you will want to be able to examine the precise version of the materials that you used to gather those data in order to identify the source of the problem.
### Organize your data for later analysis: Spreadsheets
Data come in many forms, but chances are that at some point during your project you will end up with a spreadsheet full of information. Well-organized spreadsheets can mean the difference between project success and failure! A wonderful article by @broman2018 lays out principles of good spreadsheet design. We highlight some of their principles here (with our own, opinionated ordering):
1. *Make it a rectangle*.^[Think of your data like a well-ordered plate of sushi, neatly packed together without any gaps.] Nearly all data analysis software, like SPSS, Stata, Jamovi, and JASP (and many R packages), require data to be in a tabular format.^[Tabular data is a precursor to "tidy" data, which we describe in more detail in @sec-tidyverse.] If you are used to analyzing data exclusively in a spreadsheet, this kind of tabular data isn't quite as readable, but readable formatting gets in the way of almost any analysis you want to do. [Figure @fig-management-broman-nonrect] gives some examples of nonrectangular spreadsheets. All of these will cause any analytic package to choke because of inconsistencies in how rows and columns are used!
![Examples of non-rectangular spreadsheet formats that are likely to cause problems in analysis. Adapted from @broman2018.](images/management/broman.png){#fig-management-broman-nonrect .margin-caption fig-alt="4 spreadsheets that are non-rectangular due to having empty rows, variables broken across rows, etc."}
2. *Choose good names for your variables*. No one convention for name formatting is best, but it's important to be consistent. We tend to follow the [tidyverse style guide](https://style.tidyverse.org) and use lowercase words separated by underscores (`_`). It's also helpful to give units where these are available---for example, whether reaction times are in seconds or milliseconds. [Table @tbl-broman-ex] gives some examples of good and bad variable names.
\clearpage
::: {.column-margin}
\scriptsize
```{r}
#| label: tbl-broman-ex
#| tbl-cap: "Examples of good and bad variable names. Adapted from @broman2018."
broman <- tribble(
~good , ~alternative. , ~avoid ,
"subject_id", "SubID" , "subject #" ,
"sex" , "female" , "M/F" ,
"rt_ms" , "reaction_time", "rt (millisec.)"
)
kable(broman, col.names = c("Good name", "Good alternative", "Avoid"))
```
\normalsize
:::
\vspace{-1em}
3. *Be consistent with your cell formatting*. Each column should have one *kind* of thing in it. For example, if you have a column of numerical values, don't all of a sudden introduce text data like "missing" into one of the cells. This kind of mixing of data types can cause havoc down the road. Mixed or multiple entries also don't work, so don't write "0 (missing)" as the value of a cell. Leaving cells blank is also risky because it is ambiguous. Most software packages have a standard value for missing data\indexC{missing data} (e.g., `NA` is what R uses). If you are writing dates, please be sure to use the "global standard" (ISO 8601), which is YYYY-MM-DD. Anything else can be misinterpreted easily.^[Dates in Excel deserve special mention as a source of terribleness. Excel has an unfortunate habit of interpreting information that has nothing to do with dates as dates, destroying the original content in the process. Excel's issue with dates has caused unending horror in the genetics literature, where gene names are automatically converted to dates, sometimes without the researchers noticing [@ziemann2016]. In fact, some gene names have had to be changed in order to avoid this issue!]
4. *Decoration isn't data*. Decorating your data with bold headings or highlighting may seem useful for humans, but it isn't uniformly interpreted or even recognized by analysis software (e.g., reading an Excel spreadsheet into R will scrub all your beautiful highlighting and artistic fonts), so do not rely on it.
5. *Save data in plain text files*. The CSV (comma-delimited) file format is a common standard for data that is uniformly understood by most analysis software (it is an "interoperable" file format).^[Be aware of some interesting differences in how these files are output by European vs American versions of Microsoft Excel! You might find semicolons instead of commas in some datasets.] The advantage of CSVs is that they are not proprietary to Microsoft or another company and can be inspected in a text editor, but be careful: they do not preserve Excel formulas or formatting!
Given the points above, we recommend that you avoid analyzing your data in Excel. If it is necessary to analyze your data in a spreadsheet program, we urge you to save the raw data as a separate CSV and then create a distinct analysis spreadsheet so as to be sure to retain the raw data unaltered by your (or Excel's) manipulations.
### Organize your data for later analysis: Software
Many researchers do not create data by manually entering information into a spreadsheet. Instead they receive data as the output from a web platform, software package, or device. These tools typically provide researchers limited control over the format of the resulting tabular data export. Case in point is the survey platform Qualtrics\indexC{Qualtrics}, which---at least at the moment---provides data with not one but two header rows, complicating import into almost all analysis software!^[The R package `qualtRics` [@ginn2024] can help with this.]
That said, if your platform *does* allow you to control what comes out, you can try to use the principles of good tabular data design outlined above. For example, try to give your variables (e.g., questions in Qualtrics)\indexC{Qualtrics} sensible names!
::: {.callout-note title="accident report"}
### Bad variable naming can lead to analytic errors! {-}
In our methods class, students often try to reproduce the original analyses from a published study before attempting to replicate the results in a new sample of participants. When Kengthsagn Louis looked at the code for the study she was interested in, she noticed that the variables in the analysis code were named horribly (presumably because they were output this way by the survey software). For example, one piece of Stata code looked like this:
\vspace{-1em}
\footnotesize
```{verbatim=TRUE}
gen recall1=.
replace recall1=0 if Q21==1
replace recall1=1 if Q21==3 | Q21==5 | Q21==6
replace recall1=2 if Q21==2 | Q21==4 | Q21==7 | Q21==8
replace recall1=0 if Q69==1
replace recall1=1 if Q69==3 | Q69==5 | Q69==6
replace recall1=2 if Q69==2 | Q69==4 | Q69==7 | Q69==8
ta recall1
```
\normalsize
\vspace{-1em}
In the process of translating this code into R in order to reproduce the analyses, Kengthsagn and a course teaching assistant, Andrew Lampinen, noticed that some participant responses had been assigned to the wrong variables. Because the variable names were not human-readable, this error was almost impossible to detect. Since the problem affected some of the inferential conclusions of the article, the article's author---to their credit---issued an immediate correction [@petersen2019].
The moral of the story: obscure variable names can hide existing errors and create opportunities for further error! Sometimes you can adjust these in your experimental software, avoiding the issue. If not, make sure to create a "key" and translate the names immediately, double checking after you are done.
:::
### Document the format of your data
Even the best-organized tabular data are not always easy to understand by other researchers, or even yourself, especially after some time has passed. For that reason, you should make a **codebook**\indexC{codebook} (also known as a **data dictionary**) that explicitly documents what each variable is. [Figure @fig-management-mb-codebook] shows an example codebook for the trial-level data in the bottom of @fig-management-mb-datafiles. Each row represents one variable in the associated dataset. Codebooks often describe what type of variable a column is (e.g., numeric, string), and what values can appear in that column. A human-readable explanation is often given as well, providing units (e.g., "seconds") and a translation of numeric codes (e.g., "test condition is coded as 1") where relevant.
\clearpage
![Example participant (top) and trial (bottom) level data from @manybabies2020.](images/management/mb-subjects-trials.png){#fig-management-mb-datafiles .margin-caption width=95% fig-alt="2 spreadsheets: each row has one participant with ID, age, etc.; each row has one trial with trial number, looking time, etc."}
![Codebook for trial-level data (see above) from @manybabies2020.](images/management/mb-codebook.png){#fig-management-mb-codebook .margin-caption fig-alt="A spreadsheet with columns Variable Name, Type, Possible Values, Explanation."}
Creating a codebook\indexC{codebook} need not require a lot of work. Almost any documentation is better than nothing! There are also several R packages that can automatically generate a codebook for you, for example `codebook` [@arslan2019], `dataspice` [@boettiger2021], and `dataMaid` [@petersen2019data]. Adding a codebook can substantially increase the reuse value of data and prevent hours of frustration as future you and others try to decode your variable names and assumptions.
## Sharing research products
As we've been discussing throughout this chapter, if you've managed your research products effectively, sharing them with others is a far less daunting prospect, and usually just requires uploading them to an online repository like the Open Science Framework.\indexC{Open Science Framework (OSF)} This section addresses some potential limitations on sharing that you should bear in mind and discusses where and how to share research products.
### What you can and can't share
We've been advocating that you share all of your research products, especially your data. In practice, however, **participant privacy** (as well as a few other constraints) limits what you can share. Luckily, there are some concrete steps you can take to make sure that you protect participants and comply with your obligations while still realizing the benefits of data sharing.
Unless they explicitly waive their rights, participants in psychology experiments have the expectation of privacy---that is, no one should be able to identify them from the data they have provided. Protecting participant privacy is an important part of researchers' ethical responsibilities [@ross2018] and needs to be balanced against the ethical imperatives to share (see @sec-ethics).^[@meyer2018 gives an excellent overview of how to navigate various legal and ethical issues around data sharing in the US context.]
Furthermore, there are legal regulations that protect participants' data, though these vary from country to country. In the US, the relevant regulation is **HIPAA**, the Health Insurance Portability and Accountability Act\indexC{HIPAA (Health Insurance Portability and Accountability Act)}, which limits disclosures of private health information (**PHI**). In the European Union, the relevant regulation is the European **GDPR** (General Data Protection Regulation)\indexC{GDPR (General Data Protection Regulation)}. It's beyond the scope of this book to give a full treatment of these regulatory frameworks; you should consult with your local ethics board regarding compliance, but here is the way we have navigated this situation while still sharing data.
Under both frameworks, **anonymization**\indexC{anonymization} (or equivalently **de-identification**\indexC{de-identification}) of data is a key concept, such that data sharing is generally just fine if the data meet the relevant standard. Under US guidelines, researchers can follow the "safe harbor" standard^[As described on the relevant DHHS page (<https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html>).] under which data are considered to be anonymized if they do not contain identifiers like names, telephone numbers, email addresses, social security numbers, dates of birth, faces, and others. Thus, data that only contain participant IDs and nothing from this list can typically be shared without participant consent without a problem.^[US IRBs\indexC{institutional review board (IRB)} are a very decentralized bunch, and their interpretations often vary considerably. For reasons of liability or ethics, they may not allow data sharing even though it is permitted by US law. If you feel like arguing with an IRB that takes this kind of stand, you could mention that the DHHS rule actually doesn't consider de-identified data to be "human subjects" data at all, and thus the IRB may not have regulatory authority over it. We're not lawyers, and we're not sure if you'll succeed, but it could be worth a try.]
The EU's GDPR\indexC{GDPR (General Data Protection Regulation)} also allows fully anonymized data sharing, with one big complication. Putting anonymous identifiers in a data file and removing identifiable fields does not itself suffice for GDPR anonymization\indexC{anonymization} if the data are still *in principle reidentifiable* because you have maintained documentation linking IDs to identifiable data like names or email addresses. Only when the key linking identifiers to data has been destroyed are the data truly de-identified according to this standard.
De-identification\indexC{de-identification} is not always enough. As datasets get richer, statistical reidentification risks go up substantially such that, with a little bit of outside information, data can be matched with a unique individual. These risks are especially high with linguistic, physiological, and geospatial data, but they can be present even for simple behavioral experiments. In one influential demonstration, knowing a person's location on two occasions was often enough to identify their data uniquely in a huge database of credit card transactions [@de-montjoye2015].^[For an example closer to home, many of the contributing labs in the ManyBabies project logged the date of test for each participant. This useful and seemingly innocuous piece of information is unlikely to identify any particular participant---but alongside a social media post about a lab visit or a dataset about travel records, it could easily reveal a particular participant's identity.] Thus, simply removing fields from the data is a good starting point---but if you are collecting richer data about participants' behavior you may need to consult an expert.
\clearpage
::: {.callout-note title="accident report"}
### Really anonymous? {-}
When we first began teaching Psych 251, our experimental methods course at Stanford, one of the biggest contributions of the course was simply showing students how to do experiments online. Amazon's Mechanical Turk\indexC{Amazon Mechanical Turk} crowdsourcing service was relatively new, and our IRB\indexC{institutional review board (IRB)} did not have a good sense of what this service really was. We proposed that we would share data from the class and received approval for this practice. Our datasets were downloaded directly from Mechanical Turk and included participants' MTurk IDs (long alphanumeric strings that seemed completely anonymous). Several experiences caused us to reconsider this practice!
First, we discovered that MTurk IDs were in some cases linked to study participants' public Amazon "wish lists," which could both inadvertently provide information about the participant and also even potentially provide a basis for reidentification (in rare cases). This discovery led us to consult with our IRB\indexC{institutional review board (IRB)} and provide more explicit consent language in our class experiments, linking to instructions for making Amazon profiles private.
Then, a little later we received an irate email from an MTurk participant who had discovered their data on GitHub via a search for their MTurk ID. Although they were not identified in this dataset, it convinced us that at least some participants would not like this ID shared. After another consultation with the IRB,\indexC{institutional review board (IRB)} we apologized to this individual and removed their and others' IDs from our GitHub commit histories across that and other repositories. Prior to posting data, we now take care to anonymize IDs by creating a secret mapping between the IDs we post and the actual MTurk IDs.
:::
Privacy issues are ubiquitous in data sharing, and almost every experimental research project will need to solve them before sharing data. For simple projects, often these are the only issues that preclude data sharing. However, in more complex projects, other concerns can arise. Funders may have specific mandates regarding where your data should be shared. Data use agreements or collaborator preferences may restrict where and when you can share. And certain data types require much more sensitivity since they are more consequential than, say, the reaction times on a Stroop task. We include here a set of questions to walk through to plan your sharing (@fig-management-sharing-chart). When in doubt, it's often a good idea to consult with the relevant local authority---for example, your ethics board for ethical issues or your research management office for regulatory issues.
![A decision chart for thinking about sharing research products. Adapted from @klein2018.](images/management/kline.png){#fig-management-sharing-chart .margin-caption fig-alt="A flowchart asking can/must you share; what to share; when to share; how to share."}
\clearpage
### Where and how to share: the FAIR principles
<!-- ```{r hamilton, fig.cap="Before digital code and online services like the Open Science Framework, sharing computer code was pretty impractical! Margaret Hamilton, software engineer, with the computer code she and her MIT team wrote for the Apollo space mission (1969). Source: MIT Museum: <https://news.mit.edu/2016/scene-at-mit-margaret-hamilton-apollo-code-0817>", fig.margin=TRUE} -->
<!-- knitr::include_graphics("images/management/margaret-hamilton.jpg") -->
<!-- # DO WE NEED PERMISSION? -->
<!-- ``` -->
For shared research products^[Most of this discussion is about data, because that's where the community has focused its efforts. That said, almost everything here applies to other research products as well!] to be usable by others, they should meet the FAIR standard\indexC{FAIR (findable, accessible, interoperable, and reusable)} by being findable, accessible, interoperable, and reusable [@wilkinson2016].
* **Findable** products are easily discoverable to both humans and machines. That means linking to them in research reports using unique persistent identifiers (e.g., a digital object identifier [DOI])\indexC{digital object identifier (DOI)}^[DOIs are those long URL-like things that are often used to link to papers. Turns out they can also be associated with datasets and other research products. Critically, they are guaranteed to work to find stuff, whereas standard web URLs often go stale after several years when people refactor their website. Most online repositories, like the Open Science Framework,\indexC{Open Science Framework (OSF)} will issue DOIs for the research products you store there.] and attaching them with metadata\indexC{metadata} describing what they are so they can be indexed by search engines.
* **Accessibility** means that research products need to be preserved across the long term and are retrievable via their standardized identifier.
* **Interoperability** means that the research products needs to be in a format that people and machines (e.g., search engines and analysis software) can understand.
* **Reusable** means that the research products need to be well organized, documented, and licensed so that others know how to use them.
If you've followed the guidance in the rest of this chapter, then you will already be well on your way to making your research products FAIR.\indexC{FAIR (findable, accessible, interoperable, and reusable)} There are a few final steps to consider. An important decision is where you are going to share the research products. We recommend uploading the files to a repository that's designed to support FAIR principles. Personal websites don't cut it, since these sites tend to go out of date and disappear. There's also no easy way to find research products on personal sites unless you know who created them. GitHub, though it's a great platform for collaboration, isn't a FAIR\indexC{FAIR (findable, accessible, interoperable, and reusable)} repository---for one thing, products there don't necessarily have DOIs^[You can get a DOI for GitHub software through a partnership with Zenodo (<https://zenodo.org>), a FAIR-compliant repository.]---and there are no archival guarantees on files that are shared there. Perhaps surprisingly for some researchers, journal supplementary materials are also not a great place to put research products. Often supplementary materials are assigned no unique DOI\indexC{digital object identifier (DOI)} or metadata,\indexC{metadata} have limited supported formats, and have no persistence guarantees [@evangelou2005].
Fortunately, there are many repositories that help you conform to FAIR standards\indexC{FAIR (findable, accessible, interoperable, and reusable)}. Zenodo, Figshare, the Open Science Framework (OSF),\indexC{Open Science Framework (OSF)} and the various Dataverse sites are designed for this purpose, though there are many other domain-specific repositories that are particularly relevant for different research fields. We often use the OSF as it makes it easy to share all research products connected to a project in one place. Open Science Framework is FAIR\indexC{FAIR (findable, accessible, interoperable, and reusable)} compatible and allows users to assign DOIs\indexC{digital object identifier (DOI)} to their data and provide appropriate metadata.\indexC{metadata}
We recommend you attach a license to your research products. Academic culture is (usually) unburdened by discussion of intellectual property and legal rights and instead relies on scholarly norms about citation and attribution. The basic expectation is that if you rely on someone else's research, you explicitly acknowledge the relevant journal article through a citation. Although norms are still evolving, using research products created by others generally adheres to the same scholarly principle. Research products can also be useful in nonacademic contexts, however. Perhaps you created software that a company would like to use. Maybe a pediatrician would like to use a research instrument you've been working on to assess their patients. These applications (and many other reuses of the data) require a legal license. In practice, there are a number of simple, open-source licenses that permit reuse. We tend to favor [Creative Commons](https://creativecommons.org) licenses, which come in a variety of flavors such as CC0 (which allows all reuse), CC-BY (which allows reuse as long as there is attribution), and CC-BY-NC (which only allows attributed, noncommercial reuse).^[@klein2018 recommend the CC0 license, which puts no limits on what can be done with your data. At first glance, it may seem like a license that requires attribution is useful. But academic norms, rather than the threat of litigation, lead to good citation practices. In addition, more restrictive licenses can mean that some legitimate uses of your data or research can be blocked.] Regardless of what license you choose, having a license means that your products won't be in a "not sure what I'm allowed to do with this" limbo for others who are interested in reusing them.
As we have discussed, you may want to consider storing your work in a public repository from the outset of the project. If you are using GitHub to manage your project, you can link the Git\indexC{Git} repository to the Open Science Framework\indexC{Open Science Framework (OSF)} so it automatically syncs. This provides a valuable incentive to organize your work properly throughout your project and makes sharing super easy, because you've already done it! On the other hand, this way of working can feel exposed for some researchers, and it does carry some risks, however small, of "scooping" or preemption by other groups working in the same space. Fortunately you can set up the same Git-OSF workflow and keep it private until you're ready to make it public later on.
The next stage at which you should consider sharing your research products is when you submit your study to a journal. If you're still hesitant to make the project entirely public, many repositories (including OSF) will allow you to create special links that facilitate limited access to, for example, reviewers and editors. In general, the earlier you share your research products the better because there are more opportunities for others to learn from, build on, and verify your research.^[If there are errors in our work, we'd certainly love to hear about it *before* the article is published in a journal rather than after!] But if neither of these options seems appealing, please do share your research products once your paper is accepted. Doing so will increase the value (and the impact) of your publication.
## Chapter summary
All of the hard work you put into your experiments---not to mention the contributions of your participants---can be undermined by bad data and project management. As our [accident reports]{.smallcaps} and [case study]{.smallcaps} show, bad organizational practices can at a minimum cause huge headaches. Sometimes the consequences can be even worse. On the flip side, starting with a firm organizational foundation sets your experiment up for success. These practices also make it easier to share all of the products of your research, not just your findings. Such sharing is useful both for individual researchers and for the field as a whole.
<!-- TODO: Barriers to adoption of transparent practices. -->
<!-- ::: {.callout-note title="accident report"} -->
<!-- ## Security practices for databases (how not to get hit by a ransomware attack) -->
<!-- ::: -->
::: {.callout-note title="discussion questions"}
1. Find an Open Science Framework\indexC{Open Science Framework (OSF)} repository that corresponds to a published paper. What is their strategy for documenting what is shared? How easy is it to figure out where everything is and if the data and materials sharing is complete?
2. Open up the US Department of Health and Human Services "safe harbor" standards (<https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html>) and navigate to the section called "The De-identification\indexC{de-identification} Standard." Go through the list of identifiers that must be removed. Are there any on this list that you would need to include in your dataset in order to conduct your own research? Can you think of any others that do not fall on this list?
:::
::: {.callout-note title="readings"}
* A more in-depth tutorial on various aspects of scientific openness: Klein, Olivier, Tom E. Hardwicke, Frederik Aust, Johannes Breuer, Henrik Danielsson, Alicia Hofelich Mohr, Hans IJzerman, Gustav Nilsonne, Wolf Vanpaemel, and Michael C. Frank [-@klein2018]. "A Practical Guide for Transparency in Psychological Science." *Collabra: Psychology* 4 (1): 20. <https://doi.org/10.1525/collabra.158>
:::
<!-- \refs -->