Patterns of diagnosed illness for England using CPRD HES ONS data

This code has been developed to create a patient-level longitudinal dataset identifying times of onset (incidence), prevalence, remission or mortality for a list of common conditions in the population of England using patient records from primary and secondary care.

Project Status: [In progress]

Project Description

Background

Measures of the prevalence of clinical conditions often rely on survey data which can be prone to reporting biases. Using administrative data to measure multimorbidity in the population is complicated by data access and the complex nature of administrative health patient records.

This code has been developed to create a patient-level longitudinal dataset identifying times of onset (incidence), prevalence, remission or mortality for a list of common conditions in the population of England using patient records from primary and secondary care.

It uses commonly used secure datasets in secondary care in the English NHS (Hospital Episodes Statistics, HES) linked with patient records in primary care (CPRD Aurum version) and death register from the Office for National Statistics. These are not publicly available, but can be applied for here[Xx].

Measure of multimorbidity

The list of conditions has been developed to match the 20 principal conditions used to construct the Cambridge Multimorbidity Score (CMS), a useful way of comparing multimobidity for different types and intensity of conditions.

How does it work?

The scripts shared can be run in numerical order and act on flat text files for CPRD, HES and ONS data. The main processes are carried out in the dual numbered scripts (00, 01, etc) and the others are usually called from within these (with exception of 00.1 & 00.2). Below is a description of what each script does:

The scripts at the start marked 00 perform project preperation by loading libraries, and variables used in the rest of the analysis. They define the clinical conditions to be identify in the panel and to set up the structure of the panel
- 00_project_setup.R sets the base variables and functions used throughout the project and creates patient cohorts (As the sample is too large to process at once)
- 00.0_file_locations.R lists locations of files and data to be set based on your system (we used s3 locations in an AWS enivronment)
- 00.1_cprd_processing.R processes the CPRD flat files into an arrow dataset using the aurumpipeline package (perform first and only once)
- 00.2_linked_hes_processing.R processes the HES flat files in the same way as above (perform first and only once - note this uses the aurumpipeline as well as the data is linked and supplied by CRPD)
- 00.3_conditions_lookup.R loads in our required codelists for CPRD (snomed) and HES (ICD10) and combines for later use
Script 01_clean_files.R processes CRPD and HES files into cohort specific blocks, and cleans them. These can be amended by a system of flags (TRUE/FALSE) for each cleaning rule
02 scripts are concerned with patient demographic data processing.
- 02_patient.R links the patient data with IMD and ethnicity results
- 02.1_patient_ethnicity.R uses CPRD observations and a variety of HES data to assign ethnicity to patients. This is based on the number of observations that are made and the most common is assigned. In the case of equal numbers, the most recent is assigned
03 scripts identifies the dates of diagnosis and remission of chosen clinical conditions for the sample of patients both in CPRD and HES (including the different datasets of HES, inpatient, outpatient and emergency admissions)
- 03_patient_diagnosis.R creates tables of resolving and non-resolving conditions with dates for patients from CPRD observation, HES admitted and HES outpatient data
- 03.1_diagnosis_using_prodcodes.R adds to the diagnosis tables by adding conditions that may have been missed by certain prescriptions (based on each certain product codes)
- 03.2_spell_grouping.R constucts spells of disease for each patient taking into account close remission and diagnosis dates, hospital spells crossing years and removes duplicates. The source of each diagnosis is recoreded
04_panel_creation.R constructs a yearly panel dataset which flags whether each patient has a certain condition, and attributes it a weight (between 0 and 1) corresponding to the proportion of the year the condition is prevalent.
The final analysis script attaches the weight from the Cambridge Multimorbidity Score (CMS) to each condition to obtain an aggregate measure of multimorbidity, and produces outputs for the report.
- 05_analyis.R sets the final variables of interest to report on, and finalises the panel dataset before calculating the CMS for various groups over time
- 05.1_standardisation.R creates weights for the panel based on European standard population methods

Note, the sample is split into cohorts of smaller size which allows both to test the code on one cohort and to speed up the analysis. The number and size of the cohort are defined in the script '00_project_setup.R' and analysts can change it to fit their sample size or system's requirements.

Cleaning rules

There are several cleaning rules which are used throughout these scripts that can be chosen to be run or skipped by setting the following to TRUE/FALSE:

create_cohorts divide sample into cohorts
remove_dup remove duplicate CPRD practices
remove_hes_dup
use_prodcodes use prescriptions for diagnosis
spell_grouping run the spell grouping code

Data sources

Clinical Practice Research Datalink (CPRD) Aurum is a database containing details of routinely collected primary care data from EMIS IT systems in consenting general practices across England and Northern Ireland. Regulatory approvals to use CPRD data for this analysis were granted by the CPRD Independent Scientific Advisory Committee (ISAC protocol number 20-000096).

Hospital Episode Statistics data have been obtained under license from the UK XXX.

More information on these data sources is available here:

Requirements

The following R packages (available on CRAN) are needed:

arrow
data.table
bit64
tidyverse
janitor
lubridate
vroom
writexl

The code relies heavily on the CPRD Aurum pipeline written by Jay Hughes, which can be downloaded here

Useful references

On Jay Hughes' CPRD Aurum pipeline here
On Anna Head's multimorbidity codelist for CPRD here
On the development and validation of the Cambridge Multimorbidity Score here - Note, this has been derived using CPRD Gold and is adapted for CPRD Aurum in our code.
A Health Foundation report published in July 2023 uses similar analysis to understand the future patterns of illness in the population in England. This report (Health in 2040: projected patterns of illness in England, by T. Watt and co-authors) can be read here.

Authors

Jay Hughes - GitHub
Laurie Rachet-Jacquet - Twitter/X - GitHub
Ann Raymond - Twitter/X - GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patterns of diagnosed illness for England using CPRD HES ONS data

Project Status: [In progress]

Project Description

Background

Measure of multimorbidity

How does it work?

Cleaning rules

Data sources

Requirements

Useful references

Authors

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Codelists		Codelists
.gitignore		.gitignore
00.0_file_locations.R		00.0_file_locations.R
00.1_cprd_processing.R		00.1_cprd_processing.R
00.2_linked_hes_processing.R		00.2_linked_hes_processing.R
00.3_conditions_lookup.R		00.3_conditions_lookup.R
00_project_setup.R		00_project_setup.R
01_clean_files.R		01_clean_files.R
02.1_patient_ethnicity.R		02.1_patient_ethnicity.R
02_patient.R		02_patient.R
03.1_diagnoses_using_prodcodes.R		03.1_diagnoses_using_prodcodes.R
03.2_spell_grouping.R		03.2_spell_grouping.R
03_patient_diagnosis.R		03_patient_diagnosis.R
04_panel_creation.R		04_panel_creation.R
05.1_standardisation.R		05.1_standardisation.R
05_analysis.R		05_analysis.R
LICENSE		LICENSE
Patterns_of_diagnosed_illness_for_England_using_CPRD_HES_ONS_data.Rproj		Patterns_of_diagnosed_illness_for_England_using_CPRD_HES_ONS_data.Rproj
README.md		README.md

License

HFAnalyticsLab/Patterns_of_diagnosed_illness_for_England_using_CPRD_HES_ONS_data

Folders and files

Latest commit

History

Repository files navigation

Patterns of diagnosed illness for England using CPRD HES ONS data

Project Status: [In progress]

Project Description

Background

Measure of multimorbidity

How does it work?

Cleaning rules

Data sources

Requirements

Useful references

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages