-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
combine_checkboxes #196
combine_checkboxes #196
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clearly does what we want but I think the code could be clearer. Maybe we can discuss a bit more together. Generally:
- Carrying around the instrument identifiers creates a lot of complication I don't think is needed. Since we're not changing the number of rows we should just be able to
bind_cols()
. - There's a lot of NSE where I think you'd be better off creating temporary columns with a set name and renaming at the end.
- The label parsing code seems duplicative with
parse_labels()
.
API Thoughts
I could see a user wanting to apply this function to many checkboxes across many instruments. I think we need to give them a way of doing this that's better than calling this function over and over.
Can we generalize to allow multiple checkbox fields in a single instrument to be done in one call?
Do we want a function (map_supertbl()
?) that can apply a transformation iteratively to rows of a supertibble? Possibly with the ability to select specific data tibbles and vary arguments across them.
Agreed and figured that would be the next part of this PR. Should be just a matter of opening up some internal mutates with some grouping/
I think that's a great idea, and can be potentially useful for other functions we come up with (or maybe that users come up with). |
@ezraporter Coming back to this I have a couple of thoughts and questions. I could see giving a custom "tidyselect" option at default for For times when users don't want to combine all checkboxes but want to combine more than one, they're most likely to use |
Actually I think we can easily update the internals so that users could just give |
Love this idea and I think
Agree
My concern here would be that My idea would be to keep the # Some dummy data
data <- tibble(
id = 1,
x___1 = TRUE,
x___2 = NA,
y___1 = NA,
y___2 = TRUE
)
metadata <- tibble(
field_name = c("x___1", "x___2", "y___1", "y___2"),
field_type = "checkbox",
select_choices_or_calculations = "1, A | 2, B"
)
suprtbl <- tibble(
redcap_form_name = "tbl",
redcap_data = list(data),
redcap_metadata = list(metadata)
) |>
as_supertbl()
# Proposed API
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
sep = "___", # make this the default?
values_to = c("x", "y")
) |>
extract_tibble("tbl")
## A tibble: 1 × 7
# id x___1 x___2 y___1 y___2 x y
# <dbl> <lgl> <lgl> <lgl> <lgl> <fct> <fct>
#1 1 TRUE NA NA TRUE A B
# possibly give a way to infer names (inspired by across(.names = ))
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
sep = "___",
values_to = "{.col}"
)
# possibly allow for more complex specifications (inspired by separate_wider_regex)
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
patterns = c(name = ".+", "___", value = ".+"),
values_to = "{.col}"
) |
One more thought. In messing around with this I'm not sure how I feel about the |
Actually, I think we kind of need to support both logicals and > suprtbl$redcap_data[[1]] %>% select(starts_with("x") | starts_with("y"))
# A tibble: 1 × 4
x___1 x___2 y___1 y___2
<lgl> <lgl> <lgl> <lgl>
1 TRUE NA NA TRUE
> suprtbl$redcap_data[[1]] %>% select(c(starts_with("x"), starts_with("y")))
# A tibble: 1 × 4
x___1 x___2 y___1 y___2
<lgl> <lgl> <lgl> <lgl>
1 TRUE NA NA TRUE Fortunately only the code below is (currently) really concerned with what gets passed to REDCapTidieR/R/combine_checkboxes.R Lines 64 to 76 in 2dfac9a
But if you feel strongly about banning use of logicals, we'd need to implement some enforced block to users doing so. I like the API suggestions, I need to give some more thought as to how they would get implemented in practice.
Hm, not sold either way. If we're trying to align with |
Sorry, my point wasn't that we shouldn't support this. Just that both methods of selecting columns should be equivalent for the user (as I think they are now).
Some more thoughts that might help with implementation. I would break out what we need to do into 3 steps. The important point is that the result of each step basically gives you everything you need to carry out the rest of the data transformation. Step 1: Select a bunch of checkbox fields from the data.
Step 2: Convert the result into a representation that maps checkbox field names to checkbox field values.
Step 3: Apply our existing logic to each group created by API-wise, you can sum up my position as: separate parameters should be responsible for doing step 1 and step 2. That is, I think the representation in step 2 is kind of nice because all the validation can happen there:
In this light, the parameters I have in my API suggestion above are just helpers to let the user define how we go from step 1 to step 2.
Yeah, this comment was just to say I'm not sold either now so don't stick with |
@ezraporter Check out the newer API when you have a moment. Probably still a bit more to go, but I think this addresses the 3 steps you outlined above. Changes made:
Things still to do pending your thoughts:
|
Also confirming that the supertibble output works as expected when using When applying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are great improvements! Some optional refactor ideas in there. I do think we need to change the default names_*
settings and should try to do names_glue
if we can!
R/combine_checkboxes.R
Outdated
names_prefix = "", | ||
names_suffix = NULL, | ||
names_sep = "_", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that these parameters control the names of the new columns and not how variable names (ex. race___0
) are parsed into names and values?
A couple comments:
- If we go with this I think we need to make it clearer that we're assuming the checkbox fields are in
name___value
format and not giving the user any control over how that's parsed. - I don't think the defaults are right. In this example the output col is called
_race
. The default should probably just producerace
:
db_label |>
combine_checkboxes(
"demographics",
starts_with("race")
)
- I see your "Maybe" in the PR comment and also think
names_glue
would be super valuable if we can do it 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that these parameters control the names of the new columns and not how variable names (ex. race___0) are parsed into names and values?
They control the structure of the names, but the names themselves come from .value
in get_metadata_spec()
, i.e. the field name prior to the ___
checkbox changes.
If we go with this I think we need to make it clearer that we're assuming the checkbox fields are in name___value format and not giving the user any control over how that's parsed.
How about we just add a check_metadata_fields_exist()
in get_metadata_spec()
similar to what I have in the parent function for check_fields_exist()
? If checkbox names are changed and they don't appear in the metadata field_name
column, we can throw an error and suggestion. We should expect users don't manipulate the metadata tibbles, but we need the connection between the metadata tibble and the data tibble to remain intact. I think this supports our "if you change things, you need to take some responsibility for them" mindset.
I don't think the defaults are right. In this example the output col is called _race. The default should probably just produce race
I'm open to changing it, but when I was thinking through outputs I was worried about clashing with other possibly existing column names. See this example, if we're ok with that being the default in the event of a clash then I can rework this.
Click me
data <- tibble(
id = 1,
prefix = "prefix",
x___1 = TRUE,
x___2 = FALSE,
y___1 = TRUE,
y___2 = TRUE,
z___1 = FALSE,
x = "val"
)
metadata <- tibble(
field_name = c("id", "prefix", "x___1", "x___2", "y___1", "y___2", "z___1", "x"),
field_type = c("text", "text", rep("checkbox", 5), "text"),
select_choices_or_calculations = c(NA, NA, rep("1, A | 2, B", 4), "3, C", NA),
)
suprtbl <- tibble(
redcap_form_name = "tbl",
redcap_data = list(data),
redcap_metadata = list(metadata)
) |>
as_supertbl()
combine_checkboxes(supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x__"), starts_with("y"), "z___1"),
names_sep = "") %>%
pull(redcap_data)
New names:
• `x` -> `x...8`
• `x` -> `x...9`
[[1]]
# A tibble: 1 × 11
id prefix x___1 x___2 y___1 y___2 z___1 x...8 x...9 y z
<dbl> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <chr> <fct> <fct> <fct>
1 1 prefix TRUE FALSE TRUE TRUE FALSE val A Multiple NA
I see your "Maybe" in the PR comment and also think names_glue would be super valuable if we can do it 😊
ugh... Fine. I knew it was coming but figured I'll get the rest of this ironed out first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just add a
check_metadata_fields_exist()
inget_metadata_spec()
similar to what I have in the parent function forcheck_fields_exist()
? If checkbox names are changed and they don't appear in the metadatafield_name
column, we can throw an error and suggestion. We should expect users don't manipulate the metadata tibbles, but we need the connection between the metadata tibble and the data tibble to remain intact. I think this supports our "if you change things, you need to take some responsibility for them" mindset.
Okay I'm fine with this. I would also add something to the documentation noting the pattern we're looking for.
I'm open to changing it, but when I was thinking through outputs I was worried about clashing with other possibly existing column names. See this example, if we're ok with that being the default in the event of a clash then I can rework this.
I would resolve this with a warning or possibly error if the fields already exist. pivot_longer()
for example errors and directs the user to the names_repair
parameter to provide a repair strategy:
tibble(x=1:3, y=4:6, value = 10) |>
pivot_longer(x:y)
#>Error in `tidyr::pivot_longer()`:
#>! Names must be unique.
#>✖ These names are duplicated:
#> * "value" at locations 1 and 3.
#>ℹ Use argument `names_repair` to specify repair strategy.
That may be too sophisticated for us but we may actually be able to recreate that behavior pretty easily with vctrs::vec_as_names()
which is referenced in the docs for the names_repair
parameter.
ugh... Fine. I knew it was coming but figured I'll get the rest of this ironed out first.
Haha if it ends up being too tricky that's fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright added support for names_repair
and names_glue
.
names_glue
I'm still iffy on, the use case in the pivot_wider()
documentation is a bit more complicated than I believe we can support here, but try the current set up out and let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that pivot_wider()
supports more but I think we're still providing a lot value with what we have.
Imagine a user has a meals
instrument with some checkboxes like this:
field_name |
---|
breakfast___apple |
breakfast___orange |
breakfast___spinach |
lunch___apple |
lunch___orange |
lunch___spinach |
dinner___apple |
dinner___orange |
dinner___spinach |
They could do:
supertbl |>
combine_checkboxes(
"meals",
matches("breakfast|lunch|dinner"),
names_glue = "checkbox_{.value}_all"
) |>
combine_checkboxes(
"meals",
matches("breakfast|lunch|dinner") & matches("apple|orange"),
names_glue = "checkbox_{.value}_fruit"
)
How slick is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This is basically there. Just a couple comments on how to make our names_glue
implementation a little safer
DESCRIPTION
Outdated
@@ -38,6 +38,7 @@ Imports: | |||
stats | |||
Suggests: | |||
covr, | |||
glue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can import glue
. It's already a dependency of packages we import (tidyr
, stringr
at least) so we're not really changing anything by bumping it up from Suggests
R/combine_checkboxes.R
Outdated
.new_value = case_when(!is.null(names_suffix) ~ paste(names_prefix, .value, names_suffix, sep = names_sep), | ||
.default = paste(names_prefix, .data$.value, sep = names_sep) | ||
if (!is.null(names_glue)) { | ||
check_installed("glue", reason = "to use `names_glue` in `combine_checkboxes()`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove if we import glue
check_installed("glue", reason = "to use `names_glue` in `combine_checkboxes()`") |
R/combine_checkboxes.R
Outdated
mutate( | ||
.value = sub("___.*$", "", .data$field_name), | ||
.new_value = as.character(glue::glue(names_glue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be more careful with this implementation of names_glue
.
Right now we're giving names_glue
access to everything in the metadata which can cause weird results:
db_label |>
combine_checkboxes("demographics", starts_with("race"), names_glue = "xyz_{field_name}") |>
extract_tibble("demographics")
In this example we actually create the wrong number of output columns silently.
I think we should:
- Use
glue_data()
rather thanglue()
to scope what the user actually has access to glue with:
# Could include more things we want to give the user access to here
glue_env <- select(out, .value)
out <- out |>
mutate(.new_value = glue_data(glue_env, names_glue))
- Enforce the constraint that
.new_value
is the same within each level of.value
. This ensures the user always gets the expected number of output columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezraporter quick q. Happy to use glue_data()
as discussed, but in this example the output is still going to be the same since field_name
is still accessible from out
. The intention here is to have it fail instead of falsely grabbing the field_name
from the metadata (for now) right? See if this is the change we're looking for for (1):
get_metadata_spec <- function(metadata_tbl,
selected_cols,
names_prefix,
names_sep,
names_glue) {
check_metadata_fields_exist(metadata_tbl, selected_cols)
# Create a metadata reference table linking field name to raw and label values
out <- metadata_tbl %>%
filter(.data$field_name %in% selected_cols) %>%
mutate(
.value = sub("___.*$", "", .data$field_name)
)
if (!is.null(names_glue)) {
# Similar to pivot_*, use of `names_glue` overrides use of names_prefix/sep
glue_env <- select(out, .value) %>%
mutate(.new_value = as.character(glue::glue_data(., names_glue))) %>%
select(.new_value)
out <- cbind(out, glue_env)
} else {
out <- out %>%
mutate(
.new_value = case_when(names_prefix != "" ~ paste(names_prefix, .value, sep = names_sep),
.default = paste(names_prefix, .data$.value, sep = "")
)
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Enforce the constraint that .new_value is the same within each level of .value. This ensures the user always gets the expected number of output columns.
See what you think of the most recent commit for this (1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why we're converting things to factors. The validation I was talking about was something like:
check <- out |>
group_by(.value) |>
summarize(n=n_distinct(.new_value)) |>
pull(n)
if (!all(check == 1)) {
# Throw an error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I thought you wanted more of an enforcement not a check. Can implement.
Out of curiosity what's an example with this set up of how someone would still trigger this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check function implemented, but unsure what the message should be so let me know how you'd like to tweak.
error_data <- tibble::tribble(
~"id", ~"col1", ~"col2",
1, "A", "A1",
2, "B", "B1",
3, "B", "B2"
)
check_equal_col_summaries(error_data, col1, col2)
Error in
check_equal_col_summaries()
:
✖ Encountered unequal naming outputs.
!combine_checkboxes()
call resulted in column output:A
,B
, andB
and new column output:A1
,B1
, andB2
.
Runrlang::last_trace()
to see where the error occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what's an example with this set up of how someone would still trigger this?
Mmm I see. I think in our current set-up it can't get triggered so maybe it's redundant at this point? There are some weird cases like this but it's pretty contrived:
data <- tibble::tibble(.value = c("A", "A", "B"))
vector_in_env <- 1:3
data |>
mutate(.new_value = glue_data(data, "{vector_in_env}_{.value}")
All that is to say: maybe we should just drop it. If we keep it we probably want it to say something like "Checkbox field B
resulted in multiple output columns, B1
and B2
. Check that names_glue
defines only 1 output column for each checkbox field."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yea I couldn't think of a way from the UI to trigger this but there's no real harm in keeping it in for now. I can put a comment in that says we may forgo it in the future. I'll update the error message to be closer to your suggestion.
- enforced check for new value levels - ensure failure still occurs for use of metadata col names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀 🚀
Description
This PR seeks to add out first "analytics" function as a tool for data analysts working with the supertibble after it's been created.
This first function,
combine_checkboxes()
, seeks to take the wide-form output of checkbox fields in a data tibble and do the following:TRUE/FALSE
/1/0
to showing:raw
orlabel
values associated with the checkbox if a single value is selectedmulti_value_label
)values_fill
)I implemented some additional things like
check_*
style functions, but interested in thoughts, feedback, etc. I figure this will go through a few iterations before being finalized.Proposed Changes
List changes below in bullet format:
combine_checkboxes()
check_fields_exist()
) and provide helpful error messagecheckbox
field types (check_fields_are_checkboxes()
) and provide helpful error messagecombine_checkboxes()
to pkgdown siteAdditional Changes:
revdepcheck
is no longer maintained, so I'm trying out a new workflow fromr-devel
called recheck. Another option is to usetools::check_packages_in_dir()
.revdepcheck
folder to stoprenv
issuesrenv
lockfile updated as wellIssue Addressed
Closes #194
PR Checklist
Before submitting this PR, please check and verify below that the submission meets the below criteria:
.RDS
) updated underinst/testdata/create_test_data.R
usethis::use_version()
Code Review
This section to be used by the reviewer and developers during Code Review after PR submission
Code Review Checklist