Preprocess data #43

ZoeMZou · 2024-12-02T16:31:34Z

I’ve completed the conversion of preprocess_data.R and modify_dummy_vax_data.R based on the mental health repository. The reason for the action test failure was that I deleted the template dataset-definition.py. I have a couple of points I’m uncertain about and would appreciate your input:

1. Data Frame Naming:
I think the data frame should be named "df" rather than "df_vax" in modify_dummy_vax_data.R to align with how it is called in preprocess_data.R?

post-covid-respiratory/analysis/preprocess/modify_dummy_vax_data.R

Lines 7 to 13 in 95555dc

    
           # Change first jab date so that they have roughly correct distribution   
        
           df <- df %>% 
        
             mutate( 
        
               vax_date_Pfizer_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))), 
        
               vax_date_AstraZeneca_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))), 
        
               vax_date_Moderna_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))) 
        
             ) %>%

https://github.com/opensafely/post-covid-mentalhealth/blob/ddb50bd8fc9e4e1a9855dc9f9a03991487c908aa/analysis/modify_dummy_vax_data.R#L8

2. Timing Code (tic/toc):
There is no tic() call before the toc() in the respiratory repo. Is this correct?

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Line 158 in 95555dc

tictoc::toc()

3. Death and Deregistration Dates:
I’ve removed the code for adding death_date and deregistration_date since they are already included in the dataset according to our ehrQL adjustment.

4. Maybe rebase after merging the branch of dataset definition
The .yaml file will need to be revised to include actions for generating preprocess data across different cohorts. Maybe can revise after merging the branch for dataset-definition.py?

Thanks for your review.

Best wishes,
Zoe

venexia

Hi @ZoeMZou - some further refinements needed, but otherwise looking good.

analysis/specify_paths.R

analysis/preprocess/preprocess_data.R

venexia · 2024-12-10T13:56:06Z

analysis/preprocess/preprocess_data.R

+
+message(paste0("Dataset has been read successfully with N = ", nrow(df), " rows"))
+
+# Format columns ---------------------------------------------------------------


Is this necessary given you read the data in with col_types specified? If so, it should use the vectors of variables by type rather than re-doing the grep.

True, should just use the vectors defined:

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 65 to 74 in 52a33bb

# Format columns ---------------------------------------------------------------

df <- df %>%

mutate(across(all_of(date_cols),

~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),

across(contains('_birth_year'),

~ format(as.Date(., origin = "1970-01-01"), "%Y")),

across(all_of(num_cols), ~ as.numeric(.)),

across(all_of(cat_cols), ~ as.factor(.)),

across(all_of(bin_cols), ~ as.logical(.)))

Does it change any variables though as you read the data in with col_types specified? It might only be the dates that need updating here.

Agree. Other than variables in the vector of date_cols, there is no need to update for num_, cat_ or bin_.

I have a question here for _birth_year, taking qa_num_birth_year for example:

We generated it in the variables_cohort.py: qa_num_birth_year=patients.date_of_birth.year

qa_num_birth_year will be converted to numeric when identifying column classes:
num_cols <- c(grep("_num", all_cols, value = TRUE)

These codes will then convert qa_num_birth_year to string.
across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y")),

Q: do we really use qa_num_birth_year as string? or actually numeric as implied from the name?

across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y")) makes it a date with just the year displayed. I think we probably want to keep this line.

Back:

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 66 to 70 in aca510e

df <- df %>%

mutate(across(all_of(date_cols),

~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),

across(contains('_birth_year'),

~ format(as.Date(., origin = "1970-01-01"), "%Y")))

analysis/preprocess/preprocess_data.R

ZoeMZou · 2024-12-12T15:03:49Z

Hi @venexia,

Thank you so much! I’ve completed the revisions based on your comments and the revision we made in the study-definition branch. Shall I rebase now before further revision, so I can:

test whether the current codes function correctly with the generated dataset.
add the corresponding actions in the YAML file for this branch.

Thanks
Zoe

venexia · 2024-12-12T16:17:54Z

Hi @ZoeMZou - yes, please rebase now so we can check the pipeline runs from start to this point successfully. Sorry - I started reviewing and then got to a comment about rebase but will pick this up once you have rebased and sorted any merge conflicts.

The vax_date_eligible variable will be used to modify vaccination dummy data and include other useful variables in the cohort dataset. Index dates for different cohorts will be extracted in their respective scripts, rather than being defined here. While the values for these index dates will differ across cohorts, their variable names should remain the same in each cohort: index_date end_date_exposure end_date_outcome

Extract index dates to each cohort: index_date end_date_exposure end_date_outcome

They have been renamed and classed into either inex_ or cens_

ZoeMZou · 2024-12-13T12:42:15Z

Hi @ZoeMZou - yes, please rebase now so we can check the pipeline runs from start to this point successfully. Sorry - I started reviewing and then got to a comment about rebase but will pick this up once you have rebased and sorted any merge conflicts.

Hi @venexia - I have rebased this PR, and the whole pipeline runs successfully from the start to this point. Other than the revision I addressed according to your comments above, I have also did the following revision across scripts to make them run good, of course, some some bugs were detected and fixed:

Rename of censoring variables:

death_date to cens_date_death throughout all script

As we renamed the inex_ and cens_, I have further revised them in the preprocess.R, for example, here is more tidy now after deleting those already included in the purpose_ groups:

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 91 to 107 in 632fda9

    
           # Restrict columns and save analysis dataset ----------------------------------- 
        
           df1 <- df %>%  
        
             select(patient_id, 
        
                    starts_with("index_date"), 
        
                    starts_with("end_date_"), 
        
                    contains("sub_"), # Subgroups 
        
                    contains("exp_"), # Exposures 
        
                    contains("out_"), # Outcomes 
        
                    contains("cov_"), # Covariates 
        
                    contains("inex_"), # Inclusion/exclusion 
        
                    contains("cens_"), # Censor 
        
                    contains("qa_"), # Quality assurance 
        
                    contains("vax_date_eligible"), # Vaccination eligibility 
        
                    contains("vax_date_"), # Vaccination dates and vax type  
        
                    contains("vax_cat_") # Vaccination products 
        
             )

I have no idea whether we would need the vax_date_ and vax_cat_ variables as follows in our final dataset, but they were not extracted from the date dataset previously, so I did some codes revision to extract them in:

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 105 to 106 in 1daf7f7

contains("vax_date_"), # Vaccination dates and vax type

contains("vax_cat_") # Vaccination products

post-covid-respiratory/analysis/dataset_definition_cohorts.py

Lines 37 to 84 in 1daf7f7

    
           # Extract date variables for later pipelines 
        
               @table_from_file("output/index_dates.csv.gz") 
        
               class index_dates(PatientFrame): 
        
               # Vaccine category and eligibility variables 
        
                   vax_cat_jcvi_group = Series(str) 
        
                   vax_date_eligible = Series(date) 
        
               # General COVID vaccination dates 
        
                   vax_date_covid_1 = Series(date) 
        
                   vax_date_covid_2 = Series(date) 
        
                   vax_date_covid_3 = Series(date) 
        
               # Pfizer vaccine-specific dates 
        
                   vax_date_Pfizer_1 = Series(date) 
        
                   vax_date_Pfizer_2 = Series(date) 
        
                   vax_date_Pfizer_3 = Series(date) 
        
               # AstraZeneca vaccine-specific dates 
        
                   vax_date_AstraZeneca_1 = Series(date) 
        
                   vax_date_AstraZeneca_2 = Series(date) 
        
                   vax_date_AstraZeneca_3 = Series(date) 
        
               # Moderna vaccine-specific dates 
        
                   vax_date_Moderna_1 = Series(date) 
        
                   vax_date_Moderna_2 = Series(date) 
        
                   vax_date_Moderna_3 = Series(date) 
        
               # Censoring date due to death 
        
                   cens_date_death = Series(date) 
        
               # Mapping all variables from index_dates to the dataset 
        
               dataset.vax_cat_jcvi_group = index_dates.vax_cat_jcvi_group 
        
               dataset.vax_date_eligible = index_dates.vax_date_eligible 
        
               dataset.vax_date_covid_1 = index_dates.vax_date_covid_1 
        
               dataset.vax_date_covid_2 = index_dates.vax_date_covid_2 
        
               dataset.vax_date_covid_3 = index_dates.vax_date_covid_3 
        
               dataset.vax_date_Pfizer_1 = index_dates.vax_date_Pfizer_1 
        
               dataset.vax_date_Pfizer_2 = index_dates.vax_date_Pfizer_2 
        
               dataset.vax_date_Pfizer_3 = index_dates.vax_date_Pfizer_3 
        
               dataset.vax_date_AstraZeneca_1 = index_dates.vax_date_AstraZeneca_1 
        
               dataset.vax_date_AstraZeneca_2 = index_dates.vax_date_AstraZeneca_2 
        
               dataset.vax_date_AstraZeneca_3 = index_dates.vax_date_AstraZeneca_3 
        
               dataset.vax_date_Moderna_1 = index_dates.vax_date_Moderna_1 
        
               dataset.vax_date_Moderna_2 = index_dates.vax_date_Moderna_2 
        
               dataset.vax_date_Moderna_3 = index_dates.vax_date_Moderna_3 
        
               dataset.cens_date_death = index_dates.cens_date_death

ps: we would definitely need vax_date_eligible for modifying vax dummy dates in preprocess.R, so I extracted it here to make the codes running.

Also spotted the mistake about the name of index dates in each cohort, they should be consistent as index_date, end_date_exposure, end_date_outcome, previously they were named as for example: index_prevax end_prevax_exposure, end_prevax_outcome

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 93 to 96 in 632fda9

    
           df1 <- df %>%  
        
             select(patient_id, 
        
                    starts_with("index_date"), 
        
                    starts_with("end_date_"),

Please let me know whether they are right decisions. Happy to revise.

Best wishes
Zoe

venexia

Hi @ZoeMZou. This looks good. I think we need to keep the formatting of qa_num_birth_year as it was but otherwise, this is ready. To answer your question re the vax variables - these are needed to define the vax and unvax cohorts in the next step of the pipeline.

venexia · 2024-12-16T16:22:44Z

[Optional] Also if you are going with a directory approach (like you have for preprocessing) then perhaps you should have one for the dataset definition for consistency?

ZoeMZou · 2024-12-16T17:20:38Z

[Optional] Also if you are going with a directory approach (like you have for preprocessing) then perhaps you should have one for the dataset definition for consistency?

Thanks @venexia for your super prompt review.

I kept the formatting of qa_num_birth_year as it was:

post-covid-respiratory/analysis/preprocess/preprocess_data.R

Lines 66 to 70 in aca510e

    
           df <- df %>% 
        
             mutate(across(all_of(date_cols), 
        
                           ~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")), 
        
                    across(contains('_birth_year'),  
        
                           ~ format(as.Date(., origin = "1970-01-01"), "%Y")))

I moved all scripts generated from the dataset_definition PR to the dataset_definition folder, except for create_project_actions.R. I left this outside as it will need to be updated in subsequent PRs and doesn’t exclusively belong to dataset_definition.

PS: I’ve also updated the README, create_project_actions.R,and project.yaml to correct the directory for scripts.

Best wishes,

Zoe

venexia

Hi @ZoeMZou - all approved! Note: I moved utility.R and active_analyses.R back up a directory as utility.R is called across the pipeline and active_analyses.R is the complement to create_project_actions.R. Hope this is okay!

ZoeMZou · 2024-12-17T09:33:42Z

Hi @ZoeMZou - all approved! Note: I moved utility.R and active_analyses.R back up a directory as utility.R is called across the pipeline and active_analyses.R is the complement to create_project_actions.R. Hope this is okay!

Thank you so much, @venexia . Yes, totally make sense.
Zoe

ZoeMZou requested a review from venexia December 2, 2024 16:31

venexia requested changes Dec 10, 2024

View reviewed changes

venexia linked an issue Dec 12, 2024 that may be closed by this pull request

working directory for active analyses.R #56

Closed

ZoeMZou requested a review from venexia December 12, 2024 15:23

ZoeMZou added 13 commits December 12, 2024 16:35

Create preprocess_data.R

fc8a4f6

Create modify_dummy_vax_data.R

641e0bc

Update modify_dummy_vax_data.R

9c47b0d

Update preprocess_data.R

00a8c03

Create specify_paths.R

50364ae

Update preprocess_data.R

7460ad3

Delete specify_paths.R

3b0c18d

Create specify_paths_example.R

20aa35b

Update preprocess_data.R

44e3d89

Delete specify_paths_example.R

6c1d513

Create post-covid-respiratory.Rproj

5348c85

Rproject

9c134bb

Update preprocess_data.R

2c3d57a

ZoeMZou force-pushed the preprocess-data branch from 4c033f5 to 2c3d57a Compare December 12, 2024 16:36

ZoeMZou added 10 commits December 12, 2024 17:51

Update create_project_actions.R

24ffc94

Update dataset_definition_cohorts.py

41ec521

Update preprocess_data.R

64ba7df

Update variables_cohorts.py

149424c

Update variables_dates.py

1e8e677

Update project actions

74082ab

Rename death_date

18b85e2

Extract index dates to each cohort

32ae8be

Extract index dates to each cohort: index_date end_date_exposure end_date_outcome

Update preprocess_data.R

f9e6c89

They have been renamed and classed into either inex_ or cens_

ZoeMZou added 2 commits December 13, 2024 12:20

Update variables_dates.py

1daf7f7

Update preprocess_data.R

632fda9

ZoeMZou added 3 commits December 13, 2024 13:49

Update YAML

468cd66

Update preprocess_data.R

ea76977

Update README.md

9bff56f

venexia requested changes Dec 16, 2024

View reviewed changes

ZoeMZou added 3 commits December 16, 2024 16:32

Update preprocess_data.R

0fe6452

Update preprocess_data.R

aca510e

Move scripts to directory for dataset_definition

0cf4af3

ZoeMZou requested a review from venexia December 16, 2024 17:12

Move utility and active_analyses up a directory

ff24a27

venexia approved these changes Dec 17, 2024

View reviewed changes

ZoeMZou merged commit 846cc38 into main Dec 17, 2024
1 check failed

ZoeMZou deleted the preprocess-data branch December 17, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess data #43

Preprocess data #43

ZoeMZou commented Dec 2, 2024

venexia left a comment

venexia Dec 10, 2024

ZoeMZou Dec 12, 2024

venexia Dec 12, 2024

ZoeMZou Dec 12, 2024

venexia Dec 16, 2024

ZoeMZou Dec 16, 2024

ZoeMZou commented Dec 12, 2024

venexia commented Dec 12, 2024

ZoeMZou commented Dec 13, 2024

venexia left a comment

venexia commented Dec 16, 2024

ZoeMZou commented Dec 16, 2024 •

edited

Loading

venexia left a comment

ZoeMZou commented Dec 17, 2024

	# Change first jab date so that they have roughly correct distribution
	df <- df %>%
	mutate(
	vax_date_Pfizer_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))),
	vax_date_AstraZeneca_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))),
	vax_date_Moderna_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3)))
	) %>%


		message(paste0("Dataset has been read successfully with N = ", nrow(df), " rows"))

		# Format columns ---------------------------------------------------------------

	# Format columns ---------------------------------------------------------------

	df <- df %>%
	mutate(across(all_of(date_cols),
	~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),
	across(contains('_birth_year'),
	~ format(as.Date(., origin = "1970-01-01"), "%Y")),
	across(all_of(num_cols), ~ as.numeric(.)),
	across(all_of(cat_cols), ~ as.factor(.)),
	across(all_of(bin_cols), ~ as.logical(.)))

Preprocess data #43

Preprocess data #43

Conversation

ZoeMZou commented Dec 2, 2024

venexia left a comment

Choose a reason for hiding this comment

venexia Dec 10, 2024

Choose a reason for hiding this comment

ZoeMZou Dec 12, 2024

Choose a reason for hiding this comment

venexia Dec 12, 2024

Choose a reason for hiding this comment

ZoeMZou Dec 12, 2024

Choose a reason for hiding this comment

venexia Dec 16, 2024

Choose a reason for hiding this comment

ZoeMZou Dec 16, 2024

Choose a reason for hiding this comment

ZoeMZou commented Dec 12, 2024

venexia commented Dec 12, 2024

ZoeMZou commented Dec 13, 2024

venexia left a comment

Choose a reason for hiding this comment

venexia commented Dec 16, 2024

ZoeMZou commented Dec 16, 2024 • edited Loading

venexia left a comment

Choose a reason for hiding this comment

ZoeMZou commented Dec 17, 2024

ZoeMZou commented Dec 16, 2024 •

edited

Loading