This is a data package with 19 medical datasets for teaching
Reproducible Medical Research with R. The link to the pkgdown reference
website for {medicaldata} is
here and in the links at
the right. This package will be useful for anyone teaching R to medical
professionals, including doctors, nurses, pharmacists, trainees, and
students.
These datasets range from reconstructed versions of
James Lind’s scurvy dataset (1757) and the original Streptomycin for
Tuberculosis trial (1948), a 2012 RCT of indomethacin to prevent
post-ERCP pancreatitis that I was involved in, to cohort data on
SARS-CoV2 testing results (2020). Many of the datasets come from the
American Statistical Association’s TSHS (Teaching Statistics in the
Health Sciences) Resources
Portal, maintained by
Carol Bigelow at the
University of Massachusetts (with permission). A growing number of
datasets in the dev version were generously donated by Frank
Harrell from his website
here. These datasets are currently only in
the dev version of the
package on github.com, which should make it to CRAN in June of 2023.
-
Install the stable, current CRAN version with
install.packages("medicaldata")
. If you want to try out the in-development version (which may have new datasets and vignettes, but which may also be intermittently wonky), install with:remotes::install_github("higgi13425/medicaldata")
-
Then load the package with
library(medicaldata)
-
Then you can list the datasets available with
data(package = "medicaldata")
-
Then assign a particular dataset to a named object in your environment with:
covid <- medicaldata::covid_testing
wherecovid
is the name of the new object, andcovid_testing
is the name of the dataset. -
Articles (vignettes) on how to use the datasets can be found at the pkgdown website under the Articles tab.
-
You can click on the links below to view the description document and/or codebook for each dataset. This information is also available under the Reference tab above, or within R by using
help(dataset_name)
.
If you have access to data from a randomized, controlled clinical trial, or a prospective cohort study, or even a case-control study, please consider obtaining the appropriate permissions, anonymizing the data, and donating the dataset for teaching purposes to add to this package. Open an issue on the github page (source code link at the top right) to open the discussion of a data donation. I am happy to help with anonymization.
Click on links below for more details about the dataset itself in the
Description Document, and more details about the variables included in
the dataset in the Codebook. Note that each dataset also has a help file
that you can use within R or RStudio, by entering help("dataset_name")
in the Console pane. The fourth column of the table below (scroll to the
right or widen your browser window) describes the study design, as
requested by Dan Sjoberg of {gtsummary} fame.
Dataset | Description document | Codebook | Design |
---|---|---|---|
strep_tb | strep_tb_desc | strep_tb_codebook | Randomized Controlled Trial (RCT) |
scurvy | scurvy_desc | scurvy_codebook | RCT |
indo_rct | indo_rct_desc | indo_rct_codebook | RCT |
polyps | polyps_desc | polyps_codebook | RCT |
cervical dystonia (dev) | cdystonia_desc | cdystonia_codebook | RCT |
covid_testing | covid_desc | covid_codebook | Retrospective cross-sectional |
blood_storage | blood_storage_desc | blood_storage_codebook | Retrospective Cohort Study |
cytomegalovirus | cytomegalovirus_desc | cytomegalovirus_codebook | Retrospective Cohort Study |
esoph_ca | esoph_ca_desc | esoph_ca_codebook | Case-control study |
laryngoscope | laryngoscope_desc | laryngoscope_codebook | RCT |
licorice_gargle | licorice_gargle_desc | licorice_gargle_codebook | RCT |
opt | opt_desc | opt_codebook | RCT |
cath (dev) | cath_desc | cath_codebook | Retrospective Cohort Study |
smartpill | smartpill_desc | smartpill_codebook | Prospective Cohort Study |
supraclavicular | supraclavicular_desc | supraclavicular_codebook | RCT |
indometh | indometh_desc | indometh_codebook | Prospective Cohort Pharmacokinetic (PK) Study |
theoph | theoph_desc | theoph_codebook | Prospective Cohort PK Study |
diabetes (dev) | diabetes_desc | diabetes_codebook | Prospective Longitudinal Cohort Study |
thiomon (dev) | thiomon_desc | thiomon_codebook | Retrospective Cohort Study, suitable for ML |
abm (dev) | abm_desc | abm_codebook | Retrospective Cohort Study |
I am doing a beta test of messy datasets, largely in Excel, with many annoying non-tidy and non-rectangular features that will help teach data cleaning/wrangling. These are not actually in the package itself (as they are not R files), but can be found in the GitHub repository.
You can download and open these from the GitHub repo in all of their
messy Excel glory by clicking on the URL links in the table below. You
can also find them here in the list on the GitHub
repo,
where you can click on one of the *.xlsx files, then click on the
View Raw
button to download it.
You can read these datasets directly into R from the urls in the table
below with the example code found in the following code chunk, which
reads in the messy_infarct
dataset and assigns it to the object
infarct
. It may be easiest to copy the entire code chunk below by
hovering over the copy icon in the top right corner, then clicking to
copy.
# install.packages('openxlsx')
# if not already installed
library(openxlsx)
url <- "https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_infarct.xlsx"
# replace the filename "messy_infarct.xlsx" at the end of this long url path with the filename that you want to load.
# Or just copy the whole path from the URL column below.
infarct <- openxlsx::read.xlsx(url)
head(infarct)
Dataset | URL | Type of Messiness |
---|---|---|
messy_cirrhosis | “https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_cirrhosis.xlsx” | Pivot Table |
messy_infarct | “https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_infarct.xlsx” | Pivot Table |
messy_aki | “https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_aki.xlsx” | unique ids, header and footer rows, empty rows & cols, messy varnames, no units, typos in factors, visit date in headers, dates |
messy_bp | “https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_bp.xlsx” | unite and separate, vars without units, visit num in headers, data entry errors |
messy_glucose | “https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_glucose.xlsx” | factors, vars without units, visit num in headers, header rows, empty rows/cols |