Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess data #43

Merged
merged 32 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
fc8a4f6
Create preprocess_data.R
ZoeMZou Nov 11, 2024
641e0bc
Create modify_dummy_vax_data.R
ZoeMZou Dec 2, 2024
9c47b0d
Update modify_dummy_vax_data.R
ZoeMZou Dec 2, 2024
00a8c03
Update preprocess_data.R
ZoeMZou Dec 2, 2024
50364ae
Create specify_paths.R
ZoeMZou Dec 2, 2024
7460ad3
Update preprocess_data.R
ZoeMZou Dec 10, 2024
3b0c18d
Delete specify_paths.R
ZoeMZou Dec 12, 2024
20aa35b
Create specify_paths_example.R
ZoeMZou Dec 12, 2024
44e3d89
Update preprocess_data.R
ZoeMZou Dec 12, 2024
6c1d513
Delete specify_paths_example.R
ZoeMZou Dec 12, 2024
5348c85
Create post-covid-respiratory.Rproj
ZoeMZou Dec 12, 2024
9c134bb
Rproject
ZoeMZou Dec 12, 2024
2c3d57a
Update preprocess_data.R
ZoeMZou Dec 12, 2024
24ffc94
Update create_project_actions.R
ZoeMZou Dec 12, 2024
41ec521
Update dataset_definition_cohorts.py
ZoeMZou Dec 12, 2024
64ba7df
Update preprocess_data.R
ZoeMZou Dec 12, 2024
149424c
Update variables_cohorts.py
ZoeMZou Dec 12, 2024
1e8e677
Update variables_dates.py
ZoeMZou Dec 12, 2024
74082ab
Update project actions
ZoeMZou Dec 13, 2024
3f57e88
Extract vax dates/type variables to cohort datasets for later pipelines
ZoeMZou Dec 13, 2024
18b85e2
Rename death_date
ZoeMZou Dec 13, 2024
32ae8be
Extract index dates to each cohort
ZoeMZou Dec 13, 2024
f9e6c89
Update preprocess_data.R
ZoeMZou Dec 13, 2024
1daf7f7
Update variables_dates.py
ZoeMZou Dec 13, 2024
632fda9
Update preprocess_data.R
ZoeMZou Dec 13, 2024
468cd66
Update YAML
ZoeMZou Dec 13, 2024
ea76977
Update preprocess_data.R
ZoeMZou Dec 13, 2024
9bff56f
Update README.md
ZoeMZou Dec 13, 2024
0fe6452
Update preprocess_data.R
ZoeMZou Dec 16, 2024
aca510e
Update preprocess_data.R
ZoeMZou Dec 16, 2024
0cf4af3
Move scripts to directory for dataset_definition
ZoeMZou Dec 16, 2024
ff24a27
Move utility and active_analyses up a directory
venexia Dec 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,30 +9,30 @@ This repository may reflect an incomplete or incorrect analysis with no further
The content has ONLY been made public to support the OpenSAFELY [open science and transparency principles](https://www.opensafely.org/about/#contributing-to-best-practice-around-open-science) and to support the sharing of re-usable code for other subsequent users.
No clinical, policy or safety conclusions must be drawn from the contents of this repository.

# About the OpenSAFELY framework

The OpenSAFELY framework is a Trusted Research Environment (TRE) for electronic
health records research in the NHS, with a focus on public accountability and
research quality.

Read more at [OpenSAFELY.org](https://opensafely.org).

# Licences
As standard, research projects have a MIT license.

## Repository navigation

- If you are interested in how we defined our code lists, look in the [`codelists`](./codelists) folder.

- Analyses scripts are in the [`analysis`](./analysis) directory:

- If you are interested in how we defined our variables, we use the variable script [variable_helper_fuctions](analysis/variable_helper_functions.py) to define functions that generate variables. We then apply these functions in [variables_cohorts](analysis/variables_cohorts.py) to create a dictionary of variables for cohort definitions, and in [variables_dates](analysis/variables_dates.py) to create a dictionary of variables for calculating study start dates and end dates.
- If you are interested in how we defined study dates (e.g., index and end dates), these vary by cohort and are described in the protocol. We use the script [dataset_definition_dates](analysis/dataset_definition_dates.py) to generate a dataset with all required dates for each cohort. This script imported all variables generated from [variables_dates](analysis/variables_dates.py).
- If you are interested in how we defined our cohorts, we use the dataset definition script [dataset_definition_cohorts](analysis/dataset_definition_cohorts.py) to define a function that generates cohorts. This script imports all variables generated from [variables_cohorts](analysis/variables_cohorts.py) using the patient's index date, the cohort start date and the cohort end date. This approach is used to generate three cohorts: pre-vaccination, vaccinated, and unvaccinated—found in [dataset_definition_prevax](analysis/dataset_definition_prevax.py), [dataset_definition_vax](analysis/dataset_definition_vax.py), and [dataset_definition_unvax](analysis/dataset_definition_unvax.py), respectively. For each cohort, the extracted data is initially processed in the preprocess data script [preprocess data script](analysis/preprocess_data.R), which generates a flag variable for pre-existing respiratory conditions and restricts the data to relevant variables.
- If you are interested in how we defined our variables, we use the variable script [variable_helper_fuctions](analysis/dataset_definition/variable_helper_functions.py) to define functions that generate variables. We then apply these functions in [variables_cohorts](analysis/variables_cohorts.py) to create a dictionary of variables for cohort definitions, and in [variables_dates](analysis/dataset_definition/variables_dates.py) to create a dictionary of variables for calculating study start dates and end dates.
- If you are interested in how we defined study dates (e.g., index and end dates), these vary by cohort and are described in the protocol. We use the script [dataset_definition_dates](analysis/dataset_definition/dataset_definition_dates.py) to generate a dataset with all required dates for each cohort. This script imported all variables generated from [variables_dates](analysis/dataset_definition/variables_dates.py).
- If you are interested in how we defined our cohorts, we use the dataset definition script [dataset_definition_cohorts](analysis/dataset_definition/dataset_definition_cohorts.py) to define a function that generates cohorts. This script imports all variables generated from [variables_cohorts](analysis/dataset_definition/variables_cohorts.py) using the patient's index date, the cohort start date and the cohort end date. This approach is used to generate three cohorts: pre-vaccination, vaccinated, and unvaccinated—found in [dataset_definition_prevax](analysis/dataset_definition/dataset_definition_prevax.py), [dataset_definition_vax](analysis/dataset_definition/dataset_definition_vax.py), and [dataset_definition_unvax](analysis/dataset_definition/dataset_definition_unvax.py), respectively. For each cohort, the extracted data is initially processed in the preprocess data script [preprocess data script](analysis/preprocess/preprocess_data.R), which generates a flag variable for pre-existing respiratory conditions and restricts the data to relevant variables.
- This directory also contains all the R scripts that process, describe, and analyse the extracted data.

- The [active_analyses](lib/active_analyses.rds) contains a list of active analyses.

- The [`project.yaml`](./project.yaml) defines run-order and dependencies for all the analysis scripts. This file should not be edited directly. To make changes to the yaml, edit and run the [`create_project.R`](analysis/create_project.R) script which generates all the actions.
- The [`project.yaml`](./project.yaml) defines run-order and dependencies for all the analysis scripts. This file should not be edited directly. To make changes to the yaml, edit and run the [`create_project_actions.R`](analysis/create_project_actions.R) script which generates all the actions.

- Descriptive and Model outputs, including figures and tables are in the [`released_outputs`](./release_outputs) directory.

# About the OpenSAFELY framework

The OpenSAFELY framework is a Trusted Research Environment (TRE) for electronic
health records research in the NHS, with a focus on public accountability and
research quality.

Read more at [OpenSAFELY.org](https://opensafely.org).

# Licences
As standard, research projects have a MIT license.
37 changes: 34 additions & 3 deletions analysis/create_project_actions.R
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ generate_study_population <- function(cohort){
comment(glue("Generate study population - {cohort}")),
action(
name = glue("generate_study_population_{cohort}"),
run = glue("ehrql:v1 generate-dataset analysis/dataset_definition_{cohort}.py --output output/input_{cohort}.csv.gz"),
run = glue("ehrql:v1 generate-dataset analysis/dataset_definition/dataset_definition_{cohort}.py --output output/input_{cohort}.csv.gz"),
needs = list("generate_dataset_index_dates"),
highly_sensitive = list(
cohort = glue("output/input_{cohort}.csv.gz")
Expand All @@ -79,6 +79,27 @@ generate_study_population <- function(cohort){
)
}

# Create function to preprocess data -------------------------------------------

preprocess_data <- function(cohort){
splice(
comment(glue("Preprocess data - {cohort}")),
action(
name = glue("preprocess_data_{cohort}"),
run = glue("r:latest analysis/preprocess/preprocess_data.R"),
arguments = c(cohort),
needs = list("generate_dataset_index_dates",glue("generate_study_population_{cohort}")),
moderately_sensitive = list(
describe = glue("output/describe_input_{cohort}_stage0.txt"),
describe_venn = glue("output/describe_venn_{cohort}.txt")
),
highly_sensitive = list(
cohort = glue("output/input_{cohort}.rds"),
venn = glue("output/venn_{cohort}.rds")
)
)
)
}

# Define and combine all actions into a list of actions ------------------------------0

Expand All @@ -98,7 +119,7 @@ actions_list <- splice(

action(
name = glue("vax_eligibility_inputs"),
run = "r:latest analysis/metadates.R",
run = "r:latest analysis/dataset_definition/metadates.R",
highly_sensitive = list(
study_dates_json = glue("output/study_dates.json")
)
Expand All @@ -109,7 +130,7 @@ actions_list <- splice(

action(
name = "generate_dataset_index_dates",
run = "ehrql:v1 generate-dataset analysis/dataset_definition_dates.py --output output/index_dates.csv.gz",
run = "ehrql:v1 generate-dataset analysis/dataset_definition/dataset_definition_dates.py --output output/index_dates.csv.gz",
needs = list("vax_eligibility_inputs"),
highly_sensitive = list(
dataset = glue("output/index_dates.csv.gz")
Expand All @@ -123,9 +144,19 @@ actions_list <- splice(
function(x) generate_study_population(cohort = x)),
recursive = FALSE
)
),

## Preprocess data -----------------------------------------------------------

splice(
unlist(lapply(cohorts,
function(x) preprocess_data(cohort = x)),
recursive = FALSE
)
)
)


# Combine actions into project list --------------------------------------------

project_list <- splice(
Expand Down
File renamed without changes.
86 changes: 86 additions & 0 deletions analysis/dataset_definition/dataset_definition_cohorts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
from ehrql import (
create_dataset,
)
# Bring table definitions from the TPP backend
from ehrql.tables.tpp import (
patients,
)

from ehrql.query_language import table_from_file, PatientFrame, Series

from datetime import date

# Create dataset

def generate_dataset(index_date, end_date_exp, end_date_out):
dataset = create_dataset()

dataset.define_population(
patients.date_of_birth.is_not_null()
)

# Configure dummy data

dataset.configure_dummy_data(population_size=1000)

# Import variables function

from variables_cohorts import generate_variables

variables = generate_variables(index_date, end_date_exp, end_date_out)

# Assign each variable to the dataset

for var_name, var_value in variables.items():
setattr(dataset, var_name, var_value)

# Extract date variables for later pipelines

@table_from_file("output/index_dates.csv.gz")

class index_dates(PatientFrame):
# Vaccine category and eligibility variables
vax_cat_jcvi_group = Series(str)
vax_date_eligible = Series(date)

# General COVID vaccination dates
vax_date_covid_1 = Series(date)
vax_date_covid_2 = Series(date)
vax_date_covid_3 = Series(date)

# Pfizer vaccine-specific dates
vax_date_Pfizer_1 = Series(date)
vax_date_Pfizer_2 = Series(date)
vax_date_Pfizer_3 = Series(date)

# AstraZeneca vaccine-specific dates
vax_date_AstraZeneca_1 = Series(date)
vax_date_AstraZeneca_2 = Series(date)
vax_date_AstraZeneca_3 = Series(date)

# Moderna vaccine-specific dates
vax_date_Moderna_1 = Series(date)
vax_date_Moderna_2 = Series(date)
vax_date_Moderna_3 = Series(date)

# Censoring date due to death
cens_date_death = Series(date)

# Mapping all variables from index_dates to the dataset
dataset.vax_cat_jcvi_group = index_dates.vax_cat_jcvi_group
dataset.vax_date_eligible = index_dates.vax_date_eligible
dataset.vax_date_covid_1 = index_dates.vax_date_covid_1
dataset.vax_date_covid_2 = index_dates.vax_date_covid_2
dataset.vax_date_covid_3 = index_dates.vax_date_covid_3
dataset.vax_date_Pfizer_1 = index_dates.vax_date_Pfizer_1
dataset.vax_date_Pfizer_2 = index_dates.vax_date_Pfizer_2
dataset.vax_date_Pfizer_3 = index_dates.vax_date_Pfizer_3
dataset.vax_date_AstraZeneca_1 = index_dates.vax_date_AstraZeneca_1
dataset.vax_date_AstraZeneca_2 = index_dates.vax_date_AstraZeneca_2
dataset.vax_date_AstraZeneca_3 = index_dates.vax_date_AstraZeneca_3
dataset.vax_date_Moderna_1 = index_dates.vax_date_Moderna_1
dataset.vax_date_Moderna_2 = index_dates.vax_date_Moderna_2
dataset.vax_date_Moderna_3 = index_dates.vax_date_Moderna_3
dataset.cens_date_death = index_dates.cens_date_death

return dataset
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,19 @@
dataset.index_prevax = minimum_of(pandemic_start, pandemic_start)

dataset.end_prevax_exposure = minimum_of(
dataset.death_date, dataset.vax_date_covid_1, dataset.vax_date_eligible, all_eligible
dataset.cens_date_death, dataset.vax_date_covid_1, dataset.vax_date_eligible, all_eligible
)

dataset.end_prevax_outcome = minimum_of(
dataset.death_date, omicron_date
dataset.cens_date_death, omicron_date
)

dataset.index_vax = maximum_of(
dataset.vax_date_covid_2 + days(14),
delta_date
)
dataset.end_vax_exposure = minimum_of(
dataset.death_date, omicron_date
dataset.cens_date_death, omicron_date
)

dataset.end_vax_outcome = dataset.end_vax_exposure
Expand All @@ -86,8 +86,8 @@
delta_date
)
dataset.end_unvax_exposure = minimum_of(
dataset.death_date, omicron_date, dataset.vax_date_covid_1
dataset.cens_date_death, omicron_date, dataset.vax_date_covid_1
)
dataset.end_unvax_outcome = minimum_of(
dataset.death_date, omicron_date
dataset.cens_date_death, omicron_date
)
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,8 @@ class index_dates(PatientFrame):

# Create dataset

dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)

dataset.index_date = index_date
dataset.end_date_exposure = end_date_exposure
dataset.end_date_outcome = end_date_outcome
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,8 @@ class index_dates(PatientFrame):

# Create dataset

dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)

dataset.index_date = index_date
dataset.end_date_exposure = end_date_exposure
dataset.end_date_outcome = end_date_outcome
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,8 @@ class index_dates(PatientFrame):

# Create dataset

dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)

dataset.index_date = index_date
dataset.end_date_exposure = end_date_exposure
dataset.end_date_outcome = end_date_outcome
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -531,8 +531,7 @@ def generate_variables(index_date, end_date_exp, end_date_out):
),

## Covid_19 severity

sub_date_covid19_hospital = sub_date_covid19_hospital,

# case(*when_thens, otherwise=None) the conditions are evaluated in order https://docs.opensafely.org/ehrql/reference/language/#case
sub_cat_covid19_hospital = case(
when(
Expand Down
64 changes: 0 additions & 64 deletions analysis/dataset_definition_cohorts.py

This file was deleted.

Loading
Loading