Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Rmarkdown file to explain the bll workflow #80

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yulric
Copy link
Contributor

@yulric yulric commented May 4, 2022

No description provided.

# making sure to run it if we've updated it
targets::tar_target(
variable_details_sheet_file,
"../inst/extdata/bll-workflow/variable_details.csv",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the .. be replaced with a R object which stores the users directory info? Usually this would be in the config file right? I had to copy and paste my local directory in each step here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, normally this would be in a config file, for eg. the one in the huiport repo https://github.com/Big-Life-Lab/huiport/blob/master/config.yml.

Its weird that you had to update these paths though, since these are all relative, they should work on all machines without modification.

format = "file"
),
# Read and store the variable details file into an R data frame
targets::tar_target(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code ran very smoothly and brought in the diabetes, age and sex variables.

What I am missing is where are the variables selected? Does the CSV sheet get manipulated manually and only keeps the analytic variables? Thus, when the sheet is read in no variable names are needed because the sheet only has what the investigator/analyst have specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code simply imports the variables and variable details sheet as is into an R dataframe, without modifying it in any way.

To extract the names of variables which go into a particular step in the study we use the role column in the variables sheet and the function recodeflow:::select_vars_by_role. If you look at lines 159 - 164, you can see that we wanted to remove all missing from the outcome variable which was identified by the role outcome in the variables sheet.


As mentioned previously, it has three rows, one row with our outcome variable, diabetes, and two rows with our predictor, sec and age. Notice the role column and its value for each row, it makes it very clear why each variables was included in the study.

Now that we have pre-specified all our variables and have all our sheets set, lets create the dataset for our study. One advantage of having a variables and variable details sheet is that we can use the recodeflow library to easily create our dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say we wanted to create 2 new variables:
-Age categorized into 5 year bins
-Age as a polynomial term

Would both the CSV sheet and the package recodeflow need to be updated to contain this information or could this be done dynamically/ on the fly by the analyst? In keeping with this workflow I imagine it is the former.


```{r}
#DT::datatable(targets::tar_read("study_dataset"))
summary(targets::tar_read("study_dataset")$DHHGAGE_cont)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After viewing the dataset I see that sex and Diabetes are values 1,2. Within this workflow when is the best time to bring in the variable formats (e.g., 1=male, 2=female) and variable labels (i.e. DHH_SEX='SEX')?

Lets further cement this concept by looking at the descriptives for our predictors. These statistics would normally go into a table 1 in a paper. For the continuous predictors this would be the mean, minimum, and maximum while for the categorical predictors this would be the proportion and number of individuals in each category.

```{r Descriptive data}
targets::tar_script({

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is akin to the ICES SAS Bivariate macro concept (without actually being bivariate).
What role would the columns variable need to take? For example, if we had a derivation/validation datasets and that woild be the column in Table 1. Even in some cases the outcome forms the post-hoc Table 1 variable to look at crude differences in predictors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I understand your question. Are you asking if we should be adding additional rows to separate the statistics for the derivation/validation datasets? And also additional rows to look at the statistics stratified by the outcome?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worded this in a confusing way. Is it easy to add a column variable to the Table 1 (i.e. stratify the table and report column percentages)? Which role would that need to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not "easy", unfortunately. The high level steps would be,

  1. Add the a role to the stratifier variable, for example sex, in the variables sheet.
  2. Update the function that creates the data for table 1 that takes into account stratifiers, I've actually already written a function for this for HUIPoRT.
  3. Finally, you would have to make the function that makes a manuscript presentable ready table that involves those stratifiers.

```{r Descriptive data}
targets::tar_script({
library(magrittr)
library(dplyr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do packages, sheets and data need to be called in every target or is this just for demonstration purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Sorry, I had to do it this way due to limitations with the targets package and Rmarkdown files.

Finally, lets fit our logistic regression model. We will once again use a function which handles all the fitting logic, and use roles to select the outcome and predictor variables.

```{r Model fit}
targets::tar_script({

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does every step get carried down to each target? I thought each target represents a module and gets carried own to the next target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all of this would normally be in a _targets.R file and it would just be declared once. I had to do it this way due to limitations with the targets package and Rmarkdown.


* We introduced the libraries and methods to follow in the BLL workflow as well as the reasons for using the workflow
* We saw the use for the role column in the variables sheet and how to use it to avoid harcoding variable names in our analysis code. Using roles also makes it easier to add/remove variables from our analysis. Where before we would have to go through our code and do a find/replace, now we can just update the variables sheet and run our code again.
* We saw the importance of putting most of our analysis logic inside functions, which makes it easier to make portions of our analysis reusable across studies. The targets libraries in fact sometimes makes it impossible to write complicated steps without functions which is a useful side effect of its API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all this was really easy to run and see the output at each step. The idea with using Markdown is to knit the output to a HTML or PDF which is portable and reproducible.

If an analysis is entirely pre-specified this is great. However, downstream modifications are time-consuming as you cannot just write code interactively; the variable sheet, recodeflow package and target functions need to be carefully modified and tested. Is this assumption correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants