-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocess data #43
Preprocess data #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZoeMZou - some further refinements needed, but otherwise looking good.
|
||
message(paste0("Dataset has been read successfully with N = ", nrow(df), " rows")) | ||
|
||
# Format columns --------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary given you read the data in with col_types
specified? If so, it should use the vectors of variables by type rather than re-doing the grep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, should just use the vectors defined:
post-covid-respiratory/analysis/preprocess/preprocess_data.R
Lines 65 to 74 in 52a33bb
# Format columns --------------------------------------------------------------- | |
df <- df %>% | |
mutate(across(all_of(date_cols), | |
~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")), | |
across(contains('_birth_year'), | |
~ format(as.Date(., origin = "1970-01-01"), "%Y")), | |
across(all_of(num_cols), ~ as.numeric(.)), | |
across(all_of(cat_cols), ~ as.factor(.)), | |
across(all_of(bin_cols), ~ as.logical(.))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it change any variables though as you read the data in with col_types specified? It might only be the dates that need updating here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Other than variables in the vector of date_cols, there is no need to update for num_
, cat_
or bin_
.
I have a question here for _birth_year
, taking qa_num_birth_year
for example:
- We generated it in the variables_cohort.py:
qa_num_birth_year=patients.date_of_birth.year
qa_num_birth_year
will be converted to numeric when identifying column classes:
num_cols <- c(grep("_num", all_cols, value = TRUE)
- These codes will then convert
qa_num_birth_year
to string.
across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y")),
Q: do we really use qa_num_birth_year
as string? or actually numeric as implied from the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
across(contains('_birth_year'), ~ format(as.Date(., origin = "1970-01-01"), "%Y"))
makes it a date with just the year displayed. I think we probably want to keep this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back:
post-covid-respiratory/analysis/preprocess/preprocess_data.R
Lines 66 to 70 in aca510e
df <- df %>% | |
mutate(across(all_of(date_cols), | |
~ floor_date(as.Date(., format="%Y-%m-%d"), unit = "days")), | |
across(contains('_birth_year'), | |
~ format(as.Date(., origin = "1970-01-01"), "%Y"))) |
Hi @venexia, Thank you so much! I’ve completed the revisions based on your comments and the revision we made in the study-definition branch. Shall I rebase now before further revision, so I can:
Thanks |
Hi @ZoeMZou - yes, please rebase now so we can check the pipeline runs from start to this point successfully. Sorry - I started reviewing and then got to a comment about rebase but will pick this up once you have rebased and sorted any merge conflicts. |
4c033f5
to
2c3d57a
Compare
The vax_date_eligible variable will be used to modify vaccination dummy data and include other useful variables in the cohort dataset. Index dates for different cohorts will be extracted in their respective scripts, rather than being defined here. While the values for these index dates will differ across cohorts, their variable names should remain the same in each cohort: index_date end_date_exposure end_date_outcome
Extract index dates to each cohort: index_date end_date_exposure end_date_outcome
They have been renamed and classed into either inex_ or cens_
Hi @venexia - I have rebased this PR, and the whole pipeline runs successfully from the start to this point. Other than the revision I addressed according to your comments above, I have also did the following revision across scripts to make them run good, of course, some some bugs were detected and fixed:
post-covid-respiratory/analysis/dataset_definition_cohorts.py Lines 37 to 84 in 1daf7f7
ps: we would definitely need
Please let me know whether they are right decisions. Happy to revise. Best wishes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZoeMZou. This looks good. I think we need to keep the formatting of qa_num_birth_year
as it was but otherwise, this is ready. To answer your question re the vax variables - these are needed to define the vax and unvax cohorts in the next step of the pipeline.
[Optional] Also if you are going with a directory approach (like you have for preprocessing) then perhaps you should have one for the dataset definition for consistency? |
Thanks @venexia for your super prompt review.
post-covid-respiratory/analysis/preprocess/preprocess_data.R Lines 66 to 70 in aca510e
PS: I’ve also updated the README, Best wishes, Zoe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZoeMZou - all approved! Note: I moved utility.R and active_analyses.R back up a directory as utility.R is called across the pipeline and active_analyses.R is the complement to create_project_actions.R. Hope this is okay!
Hi @venexia ,
I’ve completed the conversion of preprocess_data.R and modify_dummy_vax_data.R based on the mental health repository. The reason for the action test failure was that I deleted the template dataset-definition.py. I have a couple of points I’m uncertain about and would appreciate your input:
1. Data Frame Naming:
I think the data frame should be named "df" rather than "df_vax" in modify_dummy_vax_data.R to align with how it is called in preprocess_data.R?
post-covid-respiratory/analysis/preprocess/modify_dummy_vax_data.R
Lines 7 to 13 in 95555dc
https://github.com/opensafely/post-covid-mentalhealth/blob/ddb50bd8fc9e4e1a9855dc9f9a03991487c908aa/analysis/modify_dummy_vax_data.R#L8
2. Timing Code (tic/toc):
There is no tic() call before the toc() in the respiratory repo. Is this correct?
post-covid-respiratory/analysis/preprocess/preprocess_data.R
Line 158 in 95555dc
3. Death and Deregistration Dates:
I’ve removed the code for adding death_date and deregistration_date since they are already included in the dataset according to our ehrQL adjustment.
4. Maybe rebase after merging the branch of dataset definition
The .yaml file will need to be revised to include actions for generating preprocess data across different cohorts. Maybe can revise after merging the branch for dataset-definition.py?
Thanks for your review.
Best wishes,
Zoe