Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess data #43

Merged
merged 32 commits into from
Dec 17, 2024
Merged

Preprocess data #43

merged 32 commits into from
Dec 17, 2024

Conversation

ZoeMZou
Copy link
Contributor

@ZoeMZou ZoeMZou commented Dec 2, 2024

Hi @venexia ,

I’ve completed the conversion of preprocess_data.R and modify_dummy_vax_data.R based on the mental health repository. The reason for the action test failure was that I deleted the template dataset-definition.py. I have a couple of points I’m uncertain about and would appreciate your input:

1. Data Frame Naming:
I think the data frame should be named "df" rather than "df_vax" in modify_dummy_vax_data.R to align with how it is called in preprocess_data.R?

# Change first jab date so that they have roughly correct distribution
df <- df %>%
mutate(
vax_date_Pfizer_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))),
vax_date_AstraZeneca_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3))),
vax_date_Moderna_1 = as.Date(vax_date_eligible) + days(round(rnorm(nrow(.), mean = 10, sd = 3)))
) %>%

https://github.com/opensafely/post-covid-mentalhealth/blob/ddb50bd8fc9e4e1a9855dc9f9a03991487c908aa/analysis/modify_dummy_vax_data.R#L8

2. Timing Code (tic/toc):
There is no tic() call before the toc() in the respiratory repo. Is this correct?

3. Death and Deregistration Dates:
I’ve removed the code for adding death_date and deregistration_date since they are already included in the dataset according to our ehrQL adjustment.

4. Maybe rebase after merging the branch of dataset definition
The .yaml file will need to be revised to include actions for generating preprocess data across different cohorts. Maybe can revise after merging the branch for dataset-definition.py?

Thanks for your review.

Best wishes,
Zoe

@ZoeMZou ZoeMZou requested a review from venexia December 2, 2024 16:31
Copy link
Contributor

@venexia venexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZoeMZou - some further refinements needed, but otherwise looking good.

analysis/specify_paths.R Outdated Show resolved Hide resolved
analysis/preprocess/preprocess_data.R Show resolved Hide resolved
analysis/preprocess/preprocess_data.R Outdated Show resolved Hide resolved
analysis/preprocess/preprocess_data.R Show resolved Hide resolved

message(paste0("Dataset has been read successfully with N = ", nrow(df), " rows"))

# Format columns ---------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary given you read the data in with col_types specified? If so, it should use the vectors of variables by type rather than re-doing the grep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, should just use the vectors defined:

# Format columns ---------------------------------------------------------------
df <- df %>%
mutate(across(all_of(date_cols),
~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),
across(contains('_birth_year'),
~ format(as.Date(., origin = "1970-01-01"), "%Y")),
across(all_of(num_cols), ~ as.numeric(.)),
across(all_of(cat_cols), ~ as.factor(.)),
across(all_of(bin_cols), ~ as.logical(.)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it change any variables though as you read the data in with col_types specified? It might only be the dates that need updating here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Other than variables in the vector of date_cols, there is no need to update for num_, cat_ or bin_.

I have a question here for _birth_year, taking qa_num_birth_year for example:

  1. We generated it in the variables_cohort.py: qa_num_birth_year=patients.date_of_birth.year
  2. qa_num_birth_year will be converted to numeric when identifying column classes:
    num_cols <- c(grep("_num", all_cols, value = TRUE)
  3. These codes will then convert qa_num_birth_year to string.
    across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y")),

Q: do we really use qa_num_birth_year as string? or actually numeric as implied from the name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y")) makes it a date with just the year displayed. I think we probably want to keep this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back:

df <- df %>%
mutate(across(all_of(date_cols),
~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),
across(contains('_birth_year'),
~ format(as.Date(., origin = "1970-01-01"), "%Y")))

analysis/preprocess/preprocess_data.R Outdated Show resolved Hide resolved
analysis/preprocess/preprocess_data.R Outdated Show resolved Hide resolved
analysis/preprocess/preprocess_data.R Show resolved Hide resolved
@venexia venexia linked an issue Dec 12, 2024 that may be closed by this pull request
@ZoeMZou
Copy link
Contributor Author

ZoeMZou commented Dec 12, 2024

Hi @venexia,

Thank you so much! I’ve completed the revisions based on your comments and the revision we made in the study-definition branch. Shall I rebase now before further revision, so I can:

  1. test whether the current codes function correctly with the generated dataset.
  2. add the corresponding actions in the YAML file for this branch.

Thanks
Zoe

@ZoeMZou ZoeMZou requested a review from venexia December 12, 2024 15:23
@venexia
Copy link
Contributor

venexia commented Dec 12, 2024

Hi @ZoeMZou - yes, please rebase now so we can check the pipeline runs from start to this point successfully. Sorry - I started reviewing and then got to a comment about rebase but will pick this up once you have rebased and sorted any merge conflicts.

The vax_date_eligible variable will be used to modify vaccination dummy data and include other useful variables in the cohort dataset.

Index dates for different cohorts will be extracted in their respective scripts, rather than being defined here. While the values for these index dates will differ across cohorts, their variable names should remain the same in each cohort:
index_date
end_date_exposure
end_date_outcome
Extract index dates to each cohort:
index_date
end_date_exposure
end_date_outcome
They have been renamed and classed into either inex_ or cens_
@ZoeMZou
Copy link
Contributor Author

ZoeMZou commented Dec 13, 2024

Hi @ZoeMZou - yes, please rebase now so we can check the pipeline runs from start to this point successfully. Sorry - I started reviewing and then got to a comment about rebase but will pick this up once you have rebased and sorted any merge conflicts.

Hi @venexia - I have rebased this PR, and the whole pipeline runs successfully from the start to this point. Other than the revision I addressed according to your comments above, I have also did the following revision across scripts to make them run good, of course, some some bugs were detected and fixed:

  1. Rename of censoring variables:
  • death_date to cens_date_death throughout all script
  • As we renamed the inex_ and cens_, I have further revised them in the preprocess.R, for example, here is more tidy now after deleting those already included in the purpose_ groups:
    # Restrict columns and save analysis dataset -----------------------------------
    df1 <- df %>%
    select(patient_id,
    starts_with("index_date"),
    starts_with("end_date_"),
    contains("sub_"), # Subgroups
    contains("exp_"), # Exposures
    contains("out_"), # Outcomes
    contains("cov_"), # Covariates
    contains("inex_"), # Inclusion/exclusion
    contains("cens_"), # Censor
    contains("qa_"), # Quality assurance
    contains("vax_date_eligible"), # Vaccination eligibility
    contains("vax_date_"), # Vaccination dates and vax type
    contains("vax_cat_") # Vaccination products
    )
  1. I have no idea whether we would need the vax_date_ and vax_cat_ variables as follows in our final dataset, but they were not extracted from the date dataset previously, so I did some codes revision to extract them in:
    contains("vax_date_"), # Vaccination dates and vax type
    contains("vax_cat_") # Vaccination products

# Extract date variables for later pipelines
@table_from_file("output/index_dates.csv.gz")
class index_dates(PatientFrame):
# Vaccine category and eligibility variables
vax_cat_jcvi_group = Series(str)
vax_date_eligible = Series(date)
# General COVID vaccination dates
vax_date_covid_1 = Series(date)
vax_date_covid_2 = Series(date)
vax_date_covid_3 = Series(date)
# Pfizer vaccine-specific dates
vax_date_Pfizer_1 = Series(date)
vax_date_Pfizer_2 = Series(date)
vax_date_Pfizer_3 = Series(date)
# AstraZeneca vaccine-specific dates
vax_date_AstraZeneca_1 = Series(date)
vax_date_AstraZeneca_2 = Series(date)
vax_date_AstraZeneca_3 = Series(date)
# Moderna vaccine-specific dates
vax_date_Moderna_1 = Series(date)
vax_date_Moderna_2 = Series(date)
vax_date_Moderna_3 = Series(date)
# Censoring date due to death
cens_date_death = Series(date)
# Mapping all variables from index_dates to the dataset
dataset.vax_cat_jcvi_group = index_dates.vax_cat_jcvi_group
dataset.vax_date_eligible = index_dates.vax_date_eligible
dataset.vax_date_covid_1 = index_dates.vax_date_covid_1
dataset.vax_date_covid_2 = index_dates.vax_date_covid_2
dataset.vax_date_covid_3 = index_dates.vax_date_covid_3
dataset.vax_date_Pfizer_1 = index_dates.vax_date_Pfizer_1
dataset.vax_date_Pfizer_2 = index_dates.vax_date_Pfizer_2
dataset.vax_date_Pfizer_3 = index_dates.vax_date_Pfizer_3
dataset.vax_date_AstraZeneca_1 = index_dates.vax_date_AstraZeneca_1
dataset.vax_date_AstraZeneca_2 = index_dates.vax_date_AstraZeneca_2
dataset.vax_date_AstraZeneca_3 = index_dates.vax_date_AstraZeneca_3
dataset.vax_date_Moderna_1 = index_dates.vax_date_Moderna_1
dataset.vax_date_Moderna_2 = index_dates.vax_date_Moderna_2
dataset.vax_date_Moderna_3 = index_dates.vax_date_Moderna_3
dataset.cens_date_death = index_dates.cens_date_death

ps: we would definitely need vax_date_eligible for modifying vax dummy dates in preprocess.R, so I extracted it here to make the codes running.

  1. Also spotted the mistake about the name of index dates in each cohort, they should be consistent as index_date, end_date_exposure, end_date_outcome, previously they were named as for example: index_prevax end_prevax_exposure, end_prevax_outcome
    df1 <- df %>%
    select(patient_id,
    starts_with("index_date"),
    starts_with("end_date_"),

Please let me know whether they are right decisions. Happy to revise.

Best wishes
Zoe

Copy link
Contributor

@venexia venexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZoeMZou. This looks good. I think we need to keep the formatting of qa_num_birth_year as it was but otherwise, this is ready. To answer your question re the vax variables - these are needed to define the vax and unvax cohorts in the next step of the pipeline.

@venexia
Copy link
Contributor

venexia commented Dec 16, 2024

[Optional] Also if you are going with a directory approach (like you have for preprocessing) then perhaps you should have one for the dataset definition for consistency?

@ZoeMZou ZoeMZou requested a review from venexia December 16, 2024 17:12
@ZoeMZou
Copy link
Contributor Author

ZoeMZou commented Dec 16, 2024

[Optional] Also if you are going with a directory approach (like you have for preprocessing) then perhaps you should have one for the dataset definition for consistency?

Thanks @venexia for your super prompt review.

  • I kept the formatting of qa_num_birth_year as it was:

df <- df %>%
mutate(across(all_of(date_cols),
~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")),
across(contains('_birth_year'),
~ format(as.Date(., origin = "1970-01-01"), "%Y")))

  • I moved all scripts generated from the dataset_definition PR to the dataset_definition folder, except for create_project_actions.R. I left this outside as it will need to be updated in subsequent PRs and doesn’t exclusively belong to dataset_definition.

PS: I’ve also updated the README, create_project_actions.R,and project.yaml to correct the directory for scripts.

Best wishes,

Zoe

Copy link
Contributor

@venexia venexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZoeMZou - all approved! Note: I moved utility.R and active_analyses.R back up a directory as utility.R is called across the pipeline and active_analyses.R is the complement to create_project_actions.R. Hope this is okay!

@ZoeMZou
Copy link
Contributor Author

ZoeMZou commented Dec 17, 2024

Hi @ZoeMZou - all approved! Note: I moved utility.R and active_analyses.R back up a directory as utility.R is called across the pipeline and active_analyses.R is the complement to create_project_actions.R. Hope this is okay!

Thank you so much, @venexia . Yes, totally make sense.
Zoe

@ZoeMZou ZoeMZou merged commit 846cc38 into main Dec 17, 2024
1 check failed
@ZoeMZou ZoeMZou deleted the preprocess-data branch December 17, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

working directory for active analyses.R
2 participants