opensafely · ZoeMZou · Dec 17, 2024 · Nov 11, 2024 · Dec 2, 2024 · Dec 2, 2024
diff --git a/README.md b/README.md
@@ -9,30 +9,30 @@ This repository may reflect an incomplete or incorrect analysis with no further
 The content has ONLY been made public to support the OpenSAFELY [open science and transparency principles](https://www.opensafely.org/about/#contributing-to-best-practice-around-open-science) and to support the sharing of re-usable code for other subsequent users.
 No clinical, policy or safety conclusions must be drawn from the contents of this repository.
 
-# About the OpenSAFELY framework
-
-The OpenSAFELY framework is a Trusted Research Environment (TRE) for electronic
-health records research in the NHS, with a focus on public accountability and
-research quality.
-
-Read more at [OpenSAFELY.org](https://opensafely.org).
-
-# Licences
-As standard, research projects have a MIT license. 
-
 ## Repository navigation
 
 -   If you are interested in how we defined our code lists, look in the [`codelists`](./codelists) folder.
 
 -   Analyses scripts are in the [`analysis`](./analysis) directory:
 
-    -   If you are interested in how we defined our variables, we use the variable script [variable_helper_fuctions](analysis/variable_helper_functions.py) to define functions that generate variables. We then apply these functions in [variables_cohorts](analysis/variables_cohorts.py) to create a dictionary of variables for cohort definitions, and in [variables_dates](analysis/variables_dates.py) to create a dictionary of variables for calculating study start dates and end dates.
-    -   If you are interested in how we defined study dates (e.g., index and end dates), these vary by cohort and are described in the protocol. We use the script [dataset_definition_dates](analysis/dataset_definition_dates.py) to generate a dataset with all required dates for each cohort. This script imported all variables generated from [variables_dates](analysis/variables_dates.py).
-    -   If you are interested in how we defined our cohorts, we use the dataset definition script [dataset_definition_cohorts](analysis/dataset_definition_cohorts.py) to define a function that generates cohorts. This script imports all variables generated from [variables_cohorts](analysis/variables_cohorts.py) using the patient's index date, the cohort start date and the cohort end date. This approach is used to generate three cohorts: pre-vaccination, vaccinated, and unvaccinated—found in [dataset_definition_prevax](analysis/dataset_definition_prevax.py), [dataset_definition_vax](analysis/dataset_definition_vax.py), and [dataset_definition_unvax](analysis/dataset_definition_unvax.py), respectively. For each cohort, the extracted data is initially processed in the preprocess data script [preprocess data script](analysis/preprocess_data.R), which generates a flag variable for pre-existing respiratory conditions and restricts the data to relevant variables.
+    -   If you are interested in how we defined our variables, we use the variable script [variable_helper_fuctions](analysis/dataset_definition/variable_helper_functions.py) to define functions that generate variables. We then apply these functions in [variables_cohorts](analysis/variables_cohorts.py) to create a dictionary of variables for cohort definitions, and in [variables_dates](analysis/dataset_definition/variables_dates.py) to create a dictionary of variables for calculating study start dates and end dates.
+    -   If you are interested in how we defined study dates (e.g., index and end dates), these vary by cohort and are described in the protocol. We use the script [dataset_definition_dates](analysis/dataset_definition/dataset_definition_dates.py) to generate a dataset with all required dates for each cohort. This script imported all variables generated from [variables_dates](analysis/dataset_definition/variables_dates.py).
+    -   If you are interested in how we defined our cohorts, we use the dataset definition script [dataset_definition_cohorts](analysis/dataset_definition/dataset_definition_cohorts.py) to define a function that generates cohorts. This script imports all variables generated from [variables_cohorts](analysis/dataset_definition/variables_cohorts.py) using the patient's index date, the cohort start date and the cohort end date. This approach is used to generate three cohorts: pre-vaccination, vaccinated, and unvaccinated—found in [dataset_definition_prevax](analysis/dataset_definition/dataset_definition_prevax.py), [dataset_definition_vax](analysis/dataset_definition/dataset_definition_vax.py), and [dataset_definition_unvax](analysis/dataset_definition/dataset_definition_unvax.py), respectively. For each cohort, the extracted data is initially processed in the preprocess data script [preprocess data script](analysis/preprocess/preprocess_data.R), which generates a flag variable for pre-existing respiratory conditions and restricts the data to relevant variables.
     -   This directory also contains all the R scripts that process, describe, and analyse the extracted data.
 
 -   The [active_analyses](lib/active_analyses.rds) contains a list of active analyses.
 
--   The [`project.yaml`](./project.yaml) defines run-order and dependencies for all the analysis scripts. This file should not be edited directly. To make changes to the yaml, edit and run the [`create_project.R`](analysis/create_project.R) script which generates all the actions.
+-   The [`project.yaml`](./project.yaml) defines run-order and dependencies for all the analysis scripts. This file should not be edited directly. To make changes to the yaml, edit and run the [`create_project_actions.R`](analysis/create_project_actions.R) script which generates all the actions.
 
 -   Descriptive and Model outputs, including figures and tables are in the [`released_outputs`](./release_outputs) directory.
+
+# About the OpenSAFELY framework
+
+The OpenSAFELY framework is a Trusted Research Environment (TRE) for electronic
+health records research in the NHS, with a focus on public accountability and
+research quality.
+
+Read more at [OpenSAFELY.org](https://opensafely.org).
+
+# Licences
+As standard, research projects have a MIT license. 
diff --git a/analysis/create_project_actions.R b/analysis/create_project_actions.R
@@ -70,7 +70,7 @@ generate_study_population <- function(cohort){
     comment(glue("Generate study population - {cohort}")),
     action(
       name = glue("generate_study_population_{cohort}"),
-      run = glue("ehrql:v1 generate-dataset analysis/dataset_definition_{cohort}.py --output output/input_{cohort}.csv.gz"),
+      run = glue("ehrql:v1 generate-dataset analysis/dataset_definition/dataset_definition_{cohort}.py --output output/input_{cohort}.csv.gz"),
       needs = list("generate_dataset_index_dates"),
       highly_sensitive = list(
         cohort = glue("output/input_{cohort}.csv.gz")
@@ -79,6 +79,27 @@ generate_study_population <- function(cohort){
   )
 }
 
+# Create function to preprocess data -------------------------------------------
+
+preprocess_data <- function(cohort){
+  splice(
+    comment(glue("Preprocess data - {cohort}")),
+    action(
+      name = glue("preprocess_data_{cohort}"),
+      run = glue("r:latest analysis/preprocess/preprocess_data.R"),
+      arguments = c(cohort),
+      needs = list("generate_dataset_index_dates",glue("generate_study_population_{cohort}")),
+      moderately_sensitive = list(
+        describe = glue("output/describe_input_{cohort}_stage0.txt"),
+        describe_venn = glue("output/describe_venn_{cohort}.txt")
+      ),
+      highly_sensitive = list(
+        cohort = glue("output/input_{cohort}.rds"),
+        venn = glue("output/venn_{cohort}.rds")
+      )
+    )
+  )
+}
 
 # Define and combine all actions into a list of actions ------------------------------0
 
@@ -98,7 +119,7 @@ actions_list <- splice(
 
   action(
     name = glue("vax_eligibility_inputs"),
-    run = "r:latest analysis/metadates.R",
+    run = "r:latest analysis/dataset_definition/metadates.R",
     highly_sensitive = list(
       study_dates_json = glue("output/study_dates.json")
     )
@@ -109,7 +130,7 @@ actions_list <- splice(
 
   action(
     name = "generate_dataset_index_dates",
-    run = "ehrql:v1 generate-dataset analysis/dataset_definition_dates.py --output output/index_dates.csv.gz",
+    run = "ehrql:v1 generate-dataset analysis/dataset_definition/dataset_definition_dates.py --output output/index_dates.csv.gz",
     needs = list("vax_eligibility_inputs"),
     highly_sensitive = list(
       dataset = glue("output/index_dates.csv.gz")
@@ -123,9 +144,19 @@ actions_list <- splice(
                   function(x) generate_study_population(cohort = x)), 
            recursive = FALSE
     )
+  ),
+
+  ## Preprocess data -----------------------------------------------------------
+
+  splice(
+    unlist(lapply(cohorts, 
+                  function(x) preprocess_data(cohort = x)), 
+           recursive = FALSE
+    )
   )
 )
 
+
 # Combine actions into project list --------------------------------------------
 
 project_list <- splice(

diff --git a/analysis/codelists.py → analysis/dataset_definition/codelists.py b/analysis/codelists.py → analysis/dataset_definition/codelists.py
diff --git a/analysis/dataset_definition/dataset_definition_cohorts.py b/analysis/dataset_definition/dataset_definition_cohorts.py
@@ -0,0 +1,86 @@
+from ehrql import (
+    create_dataset,
+)
+# Bring table definitions from the TPP backend 
+from ehrql.tables.tpp import ( 
+    patients, 
+)
+
+from ehrql.query_language import table_from_file, PatientFrame, Series
+
+from datetime import date
+
+# Create dataset
+
+def generate_dataset(index_date, end_date_exp, end_date_out):
+    dataset = create_dataset()
+
+    dataset.define_population(
+        patients.date_of_birth.is_not_null()
+    )
+
+# Configure dummy data
+
+    dataset.configure_dummy_data(population_size=1000)
+
+# Import variables function
+
+    from variables_cohorts import generate_variables
+
+    variables = generate_variables(index_date, end_date_exp, end_date_out)
+
+    # Assign each variable to the dataset
+
+    for var_name, var_value in variables.items():
+        setattr(dataset, var_name, var_value)
+
+# Extract date variables for later pipelines
+
+    @table_from_file("output/index_dates.csv.gz")
+
+    class index_dates(PatientFrame):
+    # Vaccine category and eligibility variables
+        vax_cat_jcvi_group = Series(str)
+        vax_date_eligible = Series(date)
+
+    # General COVID vaccination dates
+        vax_date_covid_1 = Series(date)
+        vax_date_covid_2 = Series(date)
+        vax_date_covid_3 = Series(date)
+
+    # Pfizer vaccine-specific dates
+        vax_date_Pfizer_1 = Series(date)
+        vax_date_Pfizer_2 = Series(date)
+        vax_date_Pfizer_3 = Series(date)
+
+    # AstraZeneca vaccine-specific dates
+        vax_date_AstraZeneca_1 = Series(date)
+        vax_date_AstraZeneca_2 = Series(date)
+        vax_date_AstraZeneca_3 = Series(date)
+
+    # Moderna vaccine-specific dates
+        vax_date_Moderna_1 = Series(date)
+        vax_date_Moderna_2 = Series(date)
+        vax_date_Moderna_3 = Series(date)
+
+    # Censoring date due to death
+        cens_date_death = Series(date)
+
+    # Mapping all variables from index_dates to the dataset
+    dataset.vax_cat_jcvi_group = index_dates.vax_cat_jcvi_group
+    dataset.vax_date_eligible = index_dates.vax_date_eligible
+    dataset.vax_date_covid_1 = index_dates.vax_date_covid_1
+    dataset.vax_date_covid_2 = index_dates.vax_date_covid_2
+    dataset.vax_date_covid_3 = index_dates.vax_date_covid_3
+    dataset.vax_date_Pfizer_1 = index_dates.vax_date_Pfizer_1
+    dataset.vax_date_Pfizer_2 = index_dates.vax_date_Pfizer_2
+    dataset.vax_date_Pfizer_3 = index_dates.vax_date_Pfizer_3
+    dataset.vax_date_AstraZeneca_1 = index_dates.vax_date_AstraZeneca_1
+    dataset.vax_date_AstraZeneca_2 = index_dates.vax_date_AstraZeneca_2
+    dataset.vax_date_AstraZeneca_3 = index_dates.vax_date_AstraZeneca_3
+    dataset.vax_date_Moderna_1 = index_dates.vax_date_Moderna_1
+    dataset.vax_date_Moderna_2 = index_dates.vax_date_Moderna_2
+    dataset.vax_date_Moderna_3 = index_dates.vax_date_Moderna_3
+    dataset.cens_date_death = index_dates.cens_date_death
+
+    return dataset
diff --git a/analysis/dataset_definition_dates.py → ...et_definition/dataset_definition_dates.py b/analysis/dataset_definition_dates.py → ...et_definition/dataset_definition_dates.py
@@ -64,19 +64,19 @@
 dataset.index_prevax = minimum_of(pandemic_start, pandemic_start)
 
 dataset.end_prevax_exposure = minimum_of(
-    dataset.death_date, dataset.vax_date_covid_1, dataset.vax_date_eligible, all_eligible
+    dataset.cens_date_death, dataset.vax_date_covid_1, dataset.vax_date_eligible, all_eligible
 )
 
 dataset.end_prevax_outcome = minimum_of(
-    dataset.death_date, omicron_date
+    dataset.cens_date_death, omicron_date
 )
 
 dataset.index_vax = maximum_of(
     dataset.vax_date_covid_2 + days(14),
     delta_date
 )
 dataset.end_vax_exposure = minimum_of(
-    dataset.death_date, omicron_date
+    dataset.cens_date_death, omicron_date
 )
 
 dataset.end_vax_outcome = dataset.end_vax_exposure
@@ -86,8 +86,8 @@
     delta_date
 )
 dataset.end_unvax_exposure = minimum_of(
-    dataset.death_date, omicron_date, dataset.vax_date_covid_1
+    dataset.cens_date_death, omicron_date, dataset.vax_date_covid_1
 )
 dataset.end_unvax_outcome = minimum_of(
-    dataset.death_date, omicron_date
+    dataset.cens_date_death, omicron_date
 )
diff --git a/analysis/dataset_definition_prevax.py → ...t_definition/dataset_definition_prevax.py b/analysis/dataset_definition_prevax.py → ...t_definition/dataset_definition_prevax.py
@@ -19,4 +19,8 @@ class index_dates(PatientFrame):
 
 # Create dataset
 
-dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+
+dataset.index_date = index_date
+dataset.end_date_exposure = end_date_exposure
+dataset.end_date_outcome = end_date_outcome
diff --git a/analysis/dataset_definition_unvax.py → ...et_definition/dataset_definition_unvax.py b/analysis/dataset_definition_unvax.py → ...et_definition/dataset_definition_unvax.py
@@ -19,4 +19,8 @@ class index_dates(PatientFrame):
 
 # Create dataset
 
-dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+
+dataset.index_date = index_date
+dataset.end_date_exposure = end_date_exposure
+dataset.end_date_outcome = end_date_outcome
diff --git a/analysis/dataset_definition_vax.py → ...aset_definition/dataset_definition_vax.py b/analysis/dataset_definition_vax.py → ...aset_definition/dataset_definition_vax.py
@@ -19,4 +19,8 @@ class index_dates(PatientFrame):
 
 # Create dataset
 
-dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+dataset = generate_dataset(index_date, end_date_exposure, end_date_outcome)
+
+dataset.index_date = index_date
+dataset.end_date_exposure = end_date_exposure
+dataset.end_date_outcome = end_date_outcome
diff --git a/analysis/metadates.R → analysis/dataset_definition/metadates.R b/analysis/metadates.R → analysis/dataset_definition/metadates.R
diff --git a/analysis/variable_helper_functions.py → ...t_definition/variable_helper_functions.py b/analysis/variable_helper_functions.py → ...t_definition/variable_helper_functions.py
diff --git a/analysis/variables_cohorts.py → ...s/dataset_definition/variables_cohorts.py b/analysis/variables_cohorts.py → ...s/dataset_definition/variables_cohorts.py
@@ -531,8 +531,7 @@ def generate_variables(index_date, end_date_exp, end_date_out):
         ),
 
     ## Covid_19 severity
-
-        sub_date_covid19_hospital = sub_date_covid19_hospital,
+
         # case(*when_thens, otherwise=None) the conditions are evaluated in order https://docs.opensafely.org/ehrql/reference/language/#case
         sub_cat_covid19_hospital = case(
             when(

diff --git a/analysis/variables_dates.py → ...sis/dataset_definition/variables_dates.py b/analysis/variables_dates.py → ...sis/dataset_definition/variables_dates.py
diff --git a/analysis/dataset_definition_cohorts.py b/analysis/dataset_definition_cohorts.py