Added Rmarkdown file to explain the bll workflow #80

yulric · 2022-05-04T12:32:30Z

No description provided.

robert-talarico · 2022-05-25T13:54:04Z

vignettes/bll-workflow.Rmd

+    # making sure to run it if we've updated it
+    targets::tar_target(
+      variable_details_sheet_file,
+      "../inst/extdata/bll-workflow/variable_details.csv",


Can the .. be replaced with a R object which stores the users directory info? Usually this would be in the config file right? I had to copy and paste my local directory in each step here.

Yes, normally this would be in a config file, for eg. the one in the huiport repo https://github.com/Big-Life-Lab/huiport/blob/master/config.yml.

Its weird that you had to update these paths though, since these are all relative, they should work on all machines without modification.

robert-talarico · 2022-05-25T13:55:57Z

vignettes/bll-workflow.Rmd

+      format = "file"
+    ),
+    # Read and store the variable details file into an R data frame
+    targets::tar_target(


This code ran very smoothly and brought in the diabetes, age and sex variables.

What I am missing is where are the variables selected? Does the CSV sheet get manipulated manually and only keeps the analytic variables? Thus, when the sheet is read in no variable names are needed because the sheet only has what the investigator/analyst have specified.

This part of the code simply imports the variables and variable details sheet as is into an R dataframe, without modifying it in any way.

To extract the names of variables which go into a particular step in the study we use the role column in the variables sheet and the function recodeflow:::select_vars_by_role. If you look at lines 159 - 164, you can see that we wanted to remove all missing from the outcome variable which was identified by the role outcome in the variables sheet.

robert-talarico · 2022-05-25T14:03:36Z

vignettes/bll-workflow.Rmd

+
+As mentioned previously, it has three rows, one row with our outcome variable, diabetes, and two rows with our predictor, sec and age. Notice the role column and its value for each row, it makes it very clear why each variables was included in the study. 
+
+Now that we have pre-specified all our variables and have all our sheets set, lets create the dataset for our study. One advantage of having a variables and variable details sheet is that we can use the recodeflow library to easily create our dataset.


Lets say we wanted to create 2 new variables:
-Age categorized into 5 year bins
-Age as a polynomial term

Would both the CSV sheet and the package recodeflow need to be updated to contain this information or could this be done dynamically/ on the fly by the analyst? In keeping with this workflow I imagine it is the former.

robert-talarico · 2022-05-25T14:07:55Z

vignettes/bll-workflow.Rmd

+
+```{r}
+#DT::datatable(targets::tar_read("study_dataset"))
+summary(targets::tar_read("study_dataset")$DHHGAGE_cont)


After viewing the dataset I see that sex and Diabetes are values 1,2. Within this workflow when is the best time to bring in the variable formats (e.g., 1=male, 2=female) and variable labels (i.e. DHH_SEX='SEX')?

robert-talarico · 2022-05-25T14:13:46Z

vignettes/bll-workflow.Rmd

+Lets further cement this concept by looking at the descriptives for our predictors. These statistics would normally go into a table 1 in a paper. For the continuous predictors this would be the mean, minimum, and maximum while for the categorical predictors this would be the proportion and number of individuals in each category.
+
+```{r Descriptive data}
+targets::tar_script({


This is akin to the ICES SAS Bivariate macro concept (without actually being bivariate).
What role would the columns variable need to take? For example, if we had a derivation/validation datasets and that woild be the column in Table 1. Even in some cases the outcome forms the post-hoc Table 1 variable to look at crude differences in predictors.

Sorry, not sure I understand your question. Are you asking if we should be adding additional rows to separate the statistics for the derivation/validation datasets? And also additional rows to look at the statistics stratified by the outcome?

I worded this in a confusing way. Is it easy to add a column variable to the Table 1 (i.e. stratify the table and report column percentages)? Which role would that need to be?

Its not "easy", unfortunately. The high level steps would be,

Add the a role to the stratifier variable, for example sex, in the variables sheet.

Update the function that creates the data for table 1 that takes into account stratifiers, I've actually already written a function for this for HUIPoRT.

Finally, you would have to make the function that makes a manuscript presentable ready table that involves those stratifiers.

robert-talarico · 2022-05-25T14:14:49Z

vignettes/bll-workflow.Rmd

+```{r Descriptive data}
+targets::tar_script({
+  library(magrittr)
+  library(dplyr)


Do packages, sheets and data need to be called in every target or is this just for demonstration purposes?

Nope. Sorry, I had to do it this way due to limitations with the targets package and Rmarkdown files.

robert-talarico · 2022-05-25T14:18:00Z

vignettes/bll-workflow.Rmd

+Finally, lets fit our logistic regression model. We will once again use a function which handles all the fitting logic, and use roles to select the outcome and predictor variables.
+
+```{r Model fit}
+targets::tar_script({


Why does every step get carried down to each target? I thought each target represents a module and gets carried own to the next target?

Yes, all of this would normally be in a _targets.R file and it would just be declared once. I had to do it this way due to limitations with the targets package and Rmarkdown.

robert-talarico · 2022-05-25T14:21:20Z

vignettes/bll-workflow.Rmd

+
+* We introduced the libraries and methods to follow in the BLL workflow as well as the reasons for using the workflow
+* We saw the use for the role column in the variables sheet and how to use it to avoid harcoding variable names in our analysis code. Using roles also makes it easier to add/remove variables from our analysis. Where before we would have to go through our code and do a find/replace, now we can just update the variables sheet and run our code again.
+* We saw the importance of putting most of our analysis logic inside functions, which makes it easier to make portions of our analysis reusable across studies. The targets libraries in fact sometimes makes it impossible to write complicated steps without functions which is a useful side effect of its API.


All in all this was really easy to run and see the output at each step. The idea with using Markdown is to knit the output to a HTML or PDF which is portable and reproducible.

If an analysis is entirely pre-specified this is great. However, downstream modifications are time-consuming as you cannot just write code interactively; the variable sheet, recodeflow package and target functions need to be carefully modified and tested. Is this assumption correct?

yulric added 2 commits May 3, 2022 12:34

Added first version of the Rmd explaining the BLL workflow

eae9193

Updates to the bll workflow vignette

248e756

yulric requested review from robert-talarico, DougManuel and CBjerke May 4, 2022 12:32

robert-talarico reviewed May 25, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Rmarkdown file to explain the bll workflow #80

Added Rmarkdown file to explain the bll workflow #80

yulric commented May 4, 2022 •

edited by DougManuel

Loading

robert-talarico May 25, 2022

yulric May 26, 2022

robert-talarico May 25, 2022

yulric May 26, 2022

robert-talarico May 25, 2022

robert-talarico May 25, 2022

robert-talarico May 25, 2022

yulric May 26, 2022

robert-talarico Jun 1, 2022

yulric Jun 2, 2022

robert-talarico May 25, 2022

yulric May 26, 2022

robert-talarico May 25, 2022

yulric May 26, 2022

robert-talarico May 25, 2022


		As mentioned previously, it has three rows, one row with our outcome variable, diabetes, and two rows with our predictor, sec and age. Notice the role column and its value for each row, it makes it very clear why each variables was included in the study.

		Now that we have pre-specified all our variables and have all our sheets set, lets create the dataset for our study. One advantage of having a variables and variable details sheet is that we can use the recodeflow library to easily create our dataset.

Added Rmarkdown file to explain the bll workflow #80

Are you sure you want to change the base?

Added Rmarkdown file to explain the bll workflow #80

Conversation

yulric commented May 4, 2022 • edited by DougManuel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yulric commented May 4, 2022 •

edited by DougManuel

Loading