This code has been developed to create a patient-level longitudinal dataset identifying times of onset (incidence), prevalence, remission or mortality for a list of common conditions in the population of England using patient records from primary and secondary care.
Measures of the prevalence of clinical conditions often rely on survey data which can be prone to reporting biases. Using administrative data to measure multimorbidity in the population is complicated by data access and the complex nature of administrative health patient records.
This code has been developed to create a patient-level longitudinal dataset identifying times of onset (incidence), prevalence, remission or mortality for a list of common conditions in the population of England using patient records from primary and secondary care.
It uses commonly used secure datasets in secondary care in the English NHS (Hospital Episodes Statistics, HES) linked with patient records in primary care (CPRD Aurum version) and death register from the Office for National Statistics. These are not publicly available, but can be applied for here[Xx].
The list of conditions has been developed to match the 20 principal conditions used to construct the Cambridge Multimorbidity Score (CMS), a useful way of comparing multimobidity for different types and intensity of conditions.
The scripts shared can be run in numerical order and act on flat text files for CPRD, HES and ONS data. The main processes are carried out in the dual numbered scripts (00, 01, etc) and the others are usually called from within these (with exception of 00.1 & 00.2). Below is a description of what each script does:
- The scripts at the start marked 00 perform project preperation by loading libraries, and variables used in the rest of the analysis. They define the clinical conditions to be identify in the panel and to set up the structure of the panel
00_project_setup.R
sets the base variables and functions used throughout the project and creates patient cohorts (As the sample is too large to process at once)00.0_file_locations.R
lists locations of files and data to be set based on your system (we used s3 locations in an AWS enivronment)00.1_cprd_processing.R
processes the CPRD flat files into an arrow dataset using the aurumpipeline package (perform first and only once)00.2_linked_hes_processing.R
processes the HES flat files in the same way as above (perform first and only once - note this uses the aurumpipeline as well as the data is linked and supplied by CRPD)00.3_conditions_lookup.R
loads in our required codelists for CPRD (snomed) and HES (ICD10) and combines for later use
- Script
01_clean_files.R
processes CRPD and HES files into cohort specific blocks, and cleans them. These can be amended by a system of flags (TRUE/FALSE) for each cleaning rule - 02 scripts are concerned with patient demographic data processing.
02_patient.R
links the patient data with IMD and ethnicity results02.1_patient_ethnicity.R
uses CPRD observations and a variety of HES data to assign ethnicity to patients. This is based on the number of observations that are made and the most common is assigned. In the case of equal numbers, the most recent is assigned
- 03 scripts identifies the dates of diagnosis and remission of chosen clinical conditions for the sample of patients both in CPRD and HES (including the different datasets of HES, inpatient, outpatient and emergency admissions)
03_patient_diagnosis.R
creates tables of resolving and non-resolving conditions with dates for patients from CPRD observation, HES admitted and HES outpatient data03.1_diagnosis_using_prodcodes.R
adds to the diagnosis tables by adding conditions that may have been missed by certain prescriptions (based on each certain product codes)03.2_spell_grouping.R
constucts spells of disease for each patient taking into account close remission and diagnosis dates, hospital spells crossing years and removes duplicates. The source of each diagnosis is recoreded
04_panel_creation.R
constructs a yearly panel dataset which flags whether each patient has a certain condition, and attributes it a weight (between 0 and 1) corresponding to the proportion of the year the condition is prevalent.- The final analysis script attaches the weight from the Cambridge Multimorbidity Score (CMS) to each condition to obtain an aggregate measure of multimorbidity, and produces outputs for the report.
05_analyis.R
sets the final variables of interest to report on, and finalises the panel dataset before calculating the CMS for various groups over time05.1_standardisation.R
creates weights for the panel based on European standard population methods
Note, the sample is split into cohorts of smaller size which allows both to test the code on one cohort and to speed up the analysis. The number and size of the cohort are defined in the script '00_project_setup.R' and analysts can change it to fit their sample size or system's requirements.
There are several cleaning rules which are used throughout these scripts that can be chosen to be run or skipped by setting the following to TRUE/FALSE:
create_cohorts
divide sample into cohortsremove_dup
remove duplicate CPRD practicesremove_hes_dup
use_prodcodes
use prescriptions for diagnosisspell_grouping
run the spell grouping code
Clinical Practice Research Datalink (CPRD) Aurum is a database containing details of routinely collected primary care data from EMIS IT systems in consenting general practices across England and Northern Ireland. Regulatory approvals to use CPRD data for this analysis were granted by the CPRD Independent Scientific Advisory Committee (ISAC protocol number 20-000096).
Hospital Episode Statistics data have been obtained under license from the UK XXX.
More information on these data sources is available here:
The following R packages (available on CRAN) are needed:
- arrow
- data.table
- bit64
- tidyverse
- janitor
- lubridate
- vroom
- writexl
The code relies heavily on the CPRD Aurum pipeline written by Jay Hughes, which can be downloaded here
-
On Jay Hughes' CPRD Aurum pipeline here
-
On Anna Head's multimorbidity codelist for CPRD here
-
On the development and validation of the Cambridge Multimorbidity Score here - Note, this has been derived using CPRD Gold and is adapted for CPRD Aurum in our code.
-
A Health Foundation report published in July 2023 uses similar analysis to understand the future patterns of illness in the population in England. This report (Health in 2040: projected patterns of illness in England, by T. Watt and co-authors) can be read here.
- Jay Hughes - GitHub
- Laurie Rachet-Jacquet - Twitter/X - GitHub
- Ann Raymond - Twitter/X - GitHub