Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_checkboxes #196

Merged
merged 39 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
9b0471b
Reduce function initial draft
Jul 10, 2024
859926f
Reduce function fixes
Jul 11, 2024
c42399d
Small fixes
Jul 11, 2024
8862ac1
Draft tests, add no_val param
Jul 12, 2024
96c3309
Add keep param
Jul 12, 2024
218fca4
Fix `keep` param
Jul 15, 2024
54e3a99
Update documentation and API
Jul 15, 2024
ec5c19d
Update combine_checkbox api and docs
Jul 15, 2024
6b1fb48
Add check for if no fields exist in selection
Jul 15, 2024
d55dc00
Add check_fields_are_checkboxes function
Jul 15, 2024
261342d
Minor cleaning
Jul 15, 2024
e1d4eb8
Update version, test recheck workflow
Jul 15, 2024
7207f09
Test recheck workflow file
Jul 15, 2024
4f861e1
Fix linting
Jul 15, 2024
eb11152
Add combine_checkboxes() to pkgdown
Jul 15, 2024
7348324
Remove revdepcheck, update renv
Jul 15, 2024
62080af
Add standard checks for params
Jul 15, 2024
522d01d
Filename update
Jul 16, 2024
3a395cf
Filename change
Jul 16, 2024
347d2a3
Rename test file
rsh52 Jul 17, 2024
cce0d12
Fix record_id_field assign, remove rowwise call
rsh52 Jul 17, 2024
c250eda
Remove instrument_identifiers, use bind_cols
rsh52 Jul 17, 2024
b0a8564
Implement parse_labels, clean code, fix tests
rsh52 Jul 17, 2024
21f8879
Remove record_id field, lint
rsh52 Jul 17, 2024
31797c6
Apply additional cleanup suggestions
rsh52 Jul 18, 2024
2dfac9a
Add extract_metadata fnctn, tests
rsh52 Jul 18, 2024
ed55292
Support multiple values_to, logicals, new checks
Jul 24, 2024
c0b3885
Linting
Jul 24, 2024
7789a22
Update API, clean up, new methods, new docs
Jul 29, 2024
c185e39
Add check_metadata_fields_exist, update details
Aug 2, 2024
abdc512
Consoldiate and rework checkbox value conversion
Aug 2, 2024
50d47d6
Add names_repair strategy support
Aug 5, 2024
a6d150d
Remove names_suffix, restructure prefix/sep
Aug 5, 2024
0f868b8
Add names_glue spec
Aug 5, 2024
abefbee
Add glue support with names_glue
Aug 5, 2024
06d1337
Make glue dependency, remove install check
rsh52 Aug 12, 2024
dcb1029
Update glue spec handling
rsh52 Aug 12, 2024
0295650
check_equal_col_summaries() implementation
rsh52 Aug 12, 2024
127dd46
Update error message check_equal_col_summaries()
rsh52 Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Imports:
stats
Suggests:
covr,
glue,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can import glue. It's already a dependency of packages we import (tidyr, stringr at least) so we're not really changing anything by bumping it up from Suggests

knitr,
labelled,
lintr,
Expand Down
29 changes: 29 additions & 0 deletions R/checks.R
Original file line number Diff line number Diff line change
Expand Up @@ -684,6 +684,35 @@ check_fields_exist <- function(fields, expr, call = caller_env()) {
}
}

#' @title
#' Check metadata fields exist for checkbox combination
#'
#' @description
#' Similar to [check_fields_exist()], but instead of verifying fields that exist
#' in the data tibble this seeks to verify their existence under the metadata
#' tibble `field_name`s.
#'
#' @param metadata_tbl A metadata tibble from the supertibble generated by [read_redcap()].
#' @param cols Selected columns identified for [`combine_checkboxes()`] to be
#' cross checked against `metadata_tibble$field_name`
#' @param call The calling environment to use in the error message
#'
#' @keywords internal
check_metadata_fields_exist <- function(metadata_tbl, cols, call = caller_env()) {
if (!all(cols %in% metadata_tbl$field_name)) {
msg <- c(
x = "Fields detected not present in metadata.",
`!` = "Column{?s} {.code {cols[!cols %in% metadata_tbl$field_name]}} detected as valid in the data tibble, but not found present in the metadata tibble.", # nolint: line_length_linter
`i` = "This may occur if either the names of the data tibble or the metadata tibble `field_name`s were edited."
)

cli_abort(
msg,
class = c("missing_metadata_checkbox_fields", "REDCapTidieR_cond")
)
}
}


#' @title
#' Check fields are of checkbox field type
Expand Down
195 changes: 118 additions & 77 deletions R/combine_checkboxes.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,28 @@
#' combining multiple binary columns into a singular and informative labelled
#' factor column.
#'
#' @details
#' [combine_checkboxes()] makes use of the output names of [read_redcap()]
#' data tibbles and metadata tibbles. Changes to checkbox data names or
#' metadata `field_name`s may result in errors.
#'
#' Checkbox fields are expanded to be a variable per checkbox option, separated
#' by underscores. For example, `checkbox_var` with 2 options becomes
#' `checkbox_var___1` and `checkbox_var___2`. [combine_checkboxes()] looks for
#' these and may give a error if it cannot find them.
#'
#' @param supertbl A supertibble generated by [read_redcap()]. Required.
#' @param tbl The `redcap_form_name` of the data tibble to extract. Required.
#' @param cols <[`tidy-select`][tidyr_tidy_select]> Checkbox columns to combine to
#' single column. Required.
#' @param names_prefix String added to the start of every variable name.
#' @param names_suffix String added to the end of every variable name.
#' @param names_sep String to separate new column names from `names_prefix` and/or
#' `names_suffix`.
#' @param names_sep String to separate new column names from `names_prefix`.
#' @param names_glue Instead of `names_sep` and `names_prefix`, you can supply
#' a glue specification and the unique `.value` to create custom column names.
#' @param names_repair What happens if the output has invalid column names?
#' The default, "check_unique" is to error if the columns are duplicated.
#' Use "minimal" to allow duplicates in the output, or "unique" to de-duplicated
#' by adding numeric suffixes. See [vctrs::vec_as_names()] for more options.
#' @param multi_value_label A string specifying the value to be used when multiple
#' checkbox fields are selected. Default "Multiple".
#' @param values_fill Value to use when no checkboxes are selected. Default `NA`.
Expand Down Expand Up @@ -42,17 +56,19 @@ combine_checkboxes <- function(supertbl,
tbl,
cols,
names_prefix = "",
names_suffix = NULL,
names_sep = "_",
names_glue = NULL,
names_repair = "check_unique",
multi_value_label = "Multiple",
values_fill = NA,
raw_or_label = "label",
keep = TRUE) {
# Check args ---
check_arg_is_supertbl(supertbl, req_cols = c("redcap_data", "redcap_metadata"))
check_arg_is_character(names_prefix, len = 1)
check_arg_is_character(names_suffix, len = 1, null.ok = TRUE)
check_arg_is_character(names_sep, len = 1, any.missing = TRUE)
check_arg_is_character(names_repair, len = 1, any.missing = FALSE)
check_arg_is_character(names_glue, len = 1, any.missing = FALSE, null.ok = TRUE)
check_arg_is_character(tbl, len = 1, any.missing = FALSE)
check_arg_is_character(multi_value_label, len = 1, any.missing = TRUE)
check_arg_is_character(values_fill, len = 1, any.missing = TRUE)
Expand All @@ -72,26 +88,11 @@ combine_checkboxes <- function(supertbl,

# Get metadata reference table, check that chosen fields are checkboxes
metadata_tbl <- supertbl$redcap_metadata[supertbl$redcap_form_name == tbl][[1]]
metadata_spec <- get_metadata_spec(metadata_tbl, selected_cols, names_prefix, names_suffix, names_sep)
metadata_spec <- get_metadata_spec(metadata_tbl, selected_cols, names_prefix, names_sep, names_glue)

# Define .new_col as the count of TRUEs/1s for the given checkbox field
# Assign TRUE if multiple selections made, and FALSE if one or zero made
# Copy data_tbl to mod, data_tbl to be referenced later
data_tbl_mod <- data_tbl
.new_col <- unique(metadata_spec$.new_value)

for (i in seq_along(.new_col)) {
cols_to_sum <- metadata_spec$field_name[metadata_spec$.new_value == .new_col[i]] # nolint: object_usage_linter

data_tbl_mod <- data_tbl_mod %>%
mutate(
!!.new_col[i] := case_when(
rowSums(select(., cols_to_sum)) > 1 ~ TRUE,
.default = FALSE
)
)
}

# Replace TRUEs/1s with raw/label values from metadata
data_tbl_mod <- data_tbl_mod %>%
mutate(
across(
Expand All @@ -105,15 +106,14 @@ combine_checkboxes <- function(supertbl,
across(selected_cols, as.character) # enforce to character strings
)

# Use the metadata_spec table to fill values in .new_col
data_tbl_mod <- reduce(.new_col, function(tbl, col_item) {
convert_metadata_spec(col_item, metadata_spec, tbl, raw_or_label, multi_value_label, values_fill)
}, .init = data_tbl_mod)
new_cols <- metadata_spec %>%
nest(.by = .data$.new_value, .key = "metadata") %>%
pmap(convert_checkbox_vals,
data_tbl = data_tbl_mod,
raw_or_label = raw_or_label, multi_value_label = multi_value_label, values_fill = values_fill
)

final_tbl <- bind_cols(
data_tbl,
data_tbl_mod %>% select(!!.new_col)
)
final_tbl <- combine_and_repair_tbls(data_tbl, data_tbl_mod, new_cols, names_repair = names_repair)

# Keep or remove original multi columns
if (!keep) {
Expand All @@ -139,17 +139,30 @@ combine_checkboxes <- function(supertbl,
get_metadata_spec <- function(metadata_tbl,
selected_cols,
names_prefix,
names_suffix,
names_sep) {
names_sep,
names_glue) {
check_metadata_fields_exist(metadata_tbl, selected_cols)

# Create a metadata reference table linking field name to raw and label values
out <- metadata_tbl %>%
filter(.data$field_name %in% selected_cols) %>%
mutate(
.value = sub("___.*$", "", .data$field_name),
.new_value = case_when(!is.null(names_suffix) ~ paste(names_prefix, .value, names_suffix, sep = names_sep),
.default = paste(names_prefix, .data$.value, sep = names_sep)
if (!is.null(names_glue)) {
check_installed("glue", reason = "to use `names_glue` in `combine_checkboxes()`")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove if we import glue

Suggested change
check_installed("glue", reason = "to use `names_glue` in `combine_checkboxes()`")

# Similar to pivot_*, use of `names_glue` overrides use of names_prefix/sep
out <- metadata_tbl %>%
filter(.data$field_name %in% selected_cols) %>%
mutate(
.value = sub("___.*$", "", .data$field_name),
.new_value = as.character(glue::glue(names_glue))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be more careful with this implementation of names_glue.

Right now we're giving names_glue access to everything in the metadata which can cause weird results:

db_label |>
  combine_checkboxes("demographics", starts_with("race"), names_glue = "xyz_{field_name}") |>
  extract_tibble("demographics")

In this example we actually create the wrong number of output columns silently.

I think we should:

  1. Use glue_data() rather than glue() to scope what the user actually has access to glue with:
# Could include more things we want to give the user access to here
glue_env <- select(out, .value)

out <- out |>
  mutate(.new_value = glue_data(glue_env, names_glue))
  1. Enforce the constraint that .new_value is the same within each level of .value. This ensures the user always gets the expected number of output columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezraporter quick q. Happy to use glue_data() as discussed, but in this example the output is still going to be the same since field_name is still accessible from out. The intention here is to have it fail instead of falsely grabbing the field_name from the metadata (for now) right? See if this is the change we're looking for for (1):

get_metadata_spec <- function(metadata_tbl,
                              selected_cols,
                              names_prefix,
                              names_sep,
                              names_glue) {
  check_metadata_fields_exist(metadata_tbl, selected_cols)

  # Create a metadata reference table linking field name to raw and label values
  out <- metadata_tbl %>%
    filter(.data$field_name %in% selected_cols) %>%
    mutate(
      .value = sub("___.*$", "", .data$field_name)
    )

  if (!is.null(names_glue)) {
    # Similar to pivot_*, use of `names_glue` overrides use of names_prefix/sep
    glue_env <- select(out, .value) %>%
      mutate(.new_value = as.character(glue::glue_data(., names_glue))) %>%
      select(.new_value)

    out <- cbind(out, glue_env)
  } else {
    out <- out %>%
      mutate(
        .new_value = case_when(names_prefix != "" ~ paste(names_prefix, .value, sep = names_sep),
          .default = paste(names_prefix, .data$.value, sep = "")
        )
      )
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Enforce the constraint that .new_value is the same within each level of .value. This ensures the user always gets the expected number of output columns.

See what you think of the most recent commit for this (1).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we're converting things to factors. The validation I was talking about was something like:

check <- out |>
  group_by(.value) |>
  summarize(n=n_distinct(.new_value)) |>
  pull(n)

if (!all(check == 1)) {
  # Throw an error
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I thought you wanted more of an enforcement not a check. Can implement.

Out of curiosity what's an example with this set up of how someone would still trigger this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check function implemented, but unsure what the message should be so let me know how you'd like to tweak.

  error_data <- tibble::tribble(
    ~"id", ~"col1", ~"col2",
    1, "A", "A1",
    2, "B", "B1",
    3, "B", "B2"
  )

  check_equal_col_summaries(error_data, col1, col2)

Error in check_equal_col_summaries():
✖ Encountered unequal naming outputs.
! combine_checkboxes() call resulted in column output: A, B, and B and new column output: A1, B1, and B2.
Run rlang::last_trace() to see where the error occurred.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what's an example with this set up of how someone would still trigger this?

Mmm I see. I think in our current set-up it can't get triggered so maybe it's redundant at this point? There are some weird cases like this but it's pretty contrived:

data <- tibble::tibble(.value = c("A", "A", "B"))
vector_in_env <- 1:3

data |>
  mutate(.new_value = glue_data(data, "{vector_in_env}_{.value}")

All that is to say: maybe we should just drop it. If we keep it we probably want it to say something like "Checkbox field B resulted in multiple output columns, B1 and B2. Check that names_glue defines only 1 output column for each checkbox field."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yea I couldn't think of a way from the UI to trigger this but there's no real harm in keeping it in for now. I can put a comment in that says we may forgo it in the future. I'll update the error message to be closer to your suggestion.

)
)
} else {
out <- metadata_tbl %>%
filter(.data$field_name %in% selected_cols) %>%
mutate(
.value = sub("___.*$", "", .data$field_name),
.new_value = case_when(names_prefix != "" ~ paste(names_prefix, .value, sep = names_sep),
.default = paste(names_prefix, .data$.value, sep = "")
)
)
}

# Make sure selection is checkbox metadata field type
check_fields_are_checkboxes(out)
Expand Down Expand Up @@ -192,52 +205,80 @@ replace_true <- function(col, col_name, metadata, raw_or_label) {
return(col)
}

#' @title Use metadata_spec to convert new column values
#' @title Convert a new checkbox column's values
#'
#' @description
#' [convert_metadata_spec()] uses the `metadata_spec` table provided by [get_metadata_spec()]
#' to automatically convert new column values to either:
#' @description This function takes a single column of data and converts the values
#' based on the overall data tibble cross referenced with a nested section of the
#' metadata tibble.
#'
#' - A `raw_or_label` checkbox value when only a single value is detected
#' - `mult_value_label` when multiple values are detected
#' - `values_fill` when `NA` is detected
#' [case_when()] logic helps determine whether the value is a coalesced singular
#' value or a user-specified one via `multi_value_label` or `values_fill`.
#'
#' @inheritParams combine_checkboxes
#' @param .new_col_item A character string
#' @param metadata_spec A tibble output from [convert_metadata_spec()]
#' @param data_tbl_mod A modified data tibble
#'
#' @returns a tibble
#' @details
#' This function is used in conjunction with [pmap()].
#'
#' @keywords internal
convert_metadata_spec <- function(.new_col_item,
metadata_spec,
data_tbl_mod,
raw_or_label,
multi_value_label,
values_fill) {
.col_group <- metadata_spec$field_name[metadata_spec$.new_value == .new_col_item]

metadata_overwrite <- metadata_spec %>%
filter(.data$field_name %in% .col_group) %>%
pull(raw_or_label)

data_tbl_mod <- data_tbl_mod %>%
#'
#' @param metadata A nested portion of the overall metadata tibble
#' @param data_tbl The data tibble from the original supertibble
#' @param .new_value The new column values made by [combine_checkboxes()]
#' @inheritParams combine_checkboxes
convert_checkbox_vals <- function(metadata, .new_value, data_tbl, raw_or_label, multi_value_label, values_fill) {
tibble(
!!.new_value := rowSums(!is.na(data_tbl[names(data_tbl) %in% metadata$field_name]))
) %>%
mutate(
!!.new_col_item := ifelse(!!sym(.new_col_item),
multi_value_label,
coalesce(!!!syms(.col_group))
!!.new_value := case_when(. > 1 ~ multi_value_label,
. == 1 ~ coalesce(!!!data_tbl[, names(data_tbl) %in% metadata$field_name]),
.default = values_fill
),
!!.new_col_item := ifelse(is.na(!!sym(.new_col_item)),
values_fill,
!!sym(.new_col_item)
)
) %>%
mutate(
!!.new_col_item := factor(!!sym(.new_col_item),
levels = c(metadata_overwrite, multi_value_label, values_fill)
!!.new_value := factor(!!sym(.new_value),
levels = c(metadata[[raw_or_label]], multi_value_label, values_fill)
)
)
}

#' @title Combine checkbox fields with respect to repaired outputs
#'
#' @description
#' This function seeks to preserve the original data columns and types from the
#' originally supplied data_tbl and add on the new columns from data_tbl_mod.
#'
#' If `names_repair` presents a repair strategy, the output columns will be
#' captured and updated here while dropping the original columns.
#'
#' @param data_tbl The original data table given to [combine_checkboxes()]
#' @param data_tbl_mod A modified data table from `data_tbl`
#' @param new_cols The new columns created for checkbox combination
#' @inheritParams combine_checkboxes
#'
#' @keywords internal
#'
#' @returns a tibble
combine_and_repair_tbls <- function(data_tbl, data_tbl_mod, new_cols, names_repair) {
# Perform initial column bind with repair strategy
data_tbl_mod <- bind_cols(data_tbl_mod, new_cols, .name_repair = names_repair)

# Get the column names of each table
cols_data_tbl <- names(data_tbl)
cols_data_tbl_mod <- names(data_tbl_mod)

# Identify common columns
common_cols <- intersect(cols_data_tbl, cols_data_tbl_mod)

# Identify unique columns in data_tbl_mod
unique_cols_mod <- setdiff(cols_data_tbl_mod, cols_data_tbl)

# Select common columns from data_tbl
common_data <- data_tbl %>%
select(all_of(common_cols))

# Select unique columns from data_tbl_mod
unique_data_mod <- data_tbl_mod %>%
select(all_of(unique_cols_mod))

# Combine the selected columns
combined_data <- bind_cols(common_data, unique_data_mod)

return(data_tbl_mod)
return(combined_data)
}
22 changes: 22 additions & 0 deletions man/check_metadata_fields_exist.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

31 changes: 31 additions & 0 deletions man/combine_and_repair_tbls.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading