Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] yesno fields are converted to NA when the field contains missing data codes #181

Closed
pbilodeau94 opened this issue Mar 18, 2024 · 19 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pbilodeau94
Copy link

Current Behavior

When using read_redcap, all yes/no variables are all listed as N/A. Dummy factor variables for multiple checkboxes get pulled correctly as TRUE/FALSE.

How to Reproduce the Bug:

nmosd <- read_redcap(url, token) %>%
make_labelled(format_labels = fmt_strip_trailing_colon) %>%
bind_tibbles()

@pbilodeau94 pbilodeau94 added the bug Something isn't working label Mar 18, 2024
@ezraporter ezraporter changed the title [BUG] [BUG] yesno variables are pulled as NA Mar 18, 2024
@ezraporter
Copy link
Collaborator

Thanks for the bug report! I wasn't able to reproduce this issue with the test REDCap projects we have which include yesno variables. Could you try to provide a minimal reproducible example? That could look like:

  1. The token and URI for an existing redcap project that shows the issue if you're willing and able to share it.
  2. The dataframe output of the following two calls on a redcap project that shows the issue:
REDCapR::redcap_metadata_read(redcap_uri = [YOUR_URI], token = [YOUR_TOKEN])

REDCapR::redcap_read_oneshot(
    redcap_uri = [YOUR_URI],
    token = [YOUR_TOKEN],
    export_survey_fields = TRUE,
    export_data_access_groups = TRUE,
    guess_max = .Machine$integer.max
)

It would also be helpful if you could provide the results of running sessionInfo() after running the code to produce the bug.

@ezraporter ezraporter self-assigned this Mar 18, 2024
@pbilodeau94
Copy link
Author

Thanks for your quick response! I am not able to share the token and URL since there is PHI in the database. When I run this, it looks like those variables do accurately get pulled, unlike when using read_redcap.

@ezraporter
Copy link
Collaborator

Are you able to share what those results look like after manually removing PHI? Alternatively, if you have the ability to create REDCap projects at your institution, are you able to create an example that demonstrates the bug without PHI?

Since it sounds like downloading the data from REDCap isn't where the bug is originating, it would be helpful to see your sessionInfo() so I can check if this is specific to particular package versions.

@pbilodeau94
Copy link
Author

pbilodeau94 commented Mar 18, 2024 via email

@ezraporter
Copy link
Collaborator

Thanks! I don't think the screenshots made it through. Can you try to resend them? You may need to use the github.com UI instead of email for it to work

@pbilodeau94
Copy link
Author

read_redcap : Screenshot 2024-03-18 at 5 22 34 PM

RedcapR: Screenshot 2024-03-18 at 5 23 46 PM

@ezraporter
Copy link
Collaborator

Ah okay the issue is with those "UNK" values, which I guess your project is getting through the Missing Data Codes customization. I have to think a little more about how we should handle these but agree the current behavior isn't right so we'll implement some fix.

If you have a second to think about it, what would your expectation as a user be for a yesno field that has UNK values? Convert the field to a factor with levels Yes, No, UNK? Convert UNK to NA and have the field remain TRUE/FALSE? Something else?

@ezraporter ezraporter changed the title [BUG] yesno variables are pulled as NA [BUG] yesno fields are converted to NA when the field contains missing data codes Mar 18, 2024
@pbilodeau94
Copy link
Author

pbilodeau94 commented Mar 18, 2024 via email

@ezraporter ezraporter added this to the 1.1 milestone Mar 18, 2024
@skadauke
Copy link
Collaborator

Since Missing Data Codes (MDCs) can be applied to any field type, I think we need to consider what we want to happen for each of the various field types if a MDC is enabled here. The trickiest would be the boolean-type fields (yes/no, true/false, multiple choice??). The two options I see are 1) we no longer handle these fields as boolean but turn them into factors or 2) keep them boolean and recode MDC values as NA.

I guess the default behavior could #2. However since that is also how an empty value is coded that could cause data loss which shouldn't happen silently. So I think we might want to have read_redcap() throw a warning when this is detected suggesting that the user define the desired behavior. This could be done with an optional arg to read_redcap(), e.g. boolean_mdc_fields_as_factor?

Thoughts?

@rsh52
Copy link
Collaborator

rsh52 commented Mar 19, 2024

So I think we might want to have read_redcap() throw a warning when this is detected suggesting that the user define the desired behavior. This could be done with an optional arg to read_redcap(), e.g. boolean_mdc_fields_as_factor?

IMO this sounds better than (1), I don't think the default should be to change the data types. Sort of similar to how we went about #177, this isn't exactly a typical setup so if a user wants to change something it's good to do so consciously.

Handling this is going to be non-trivial, to my knowledge there's no easy way to detect that missing data codes exist (unless I'm totally overlooking something!). I made a test database to verify, below is a sample data and metadata output for a yesno and a text field with an UNK MDC:

Browse[2]> db_data
  record_id yesno text form_1_complete
1         1     1 text               2
2         2     0 text               2
3         3   UNK  UNK               0
Browse[2]> db_metadata
# A tibble: 3 × 18
  field_name form_name section_header field_type field_label select_choices_or_calculat…¹ field_note
  <chr>      <chr>     <chr>          <chr>      <chr>       <chr>                        <chr>     
1 record_id  NA        NA             text       Record ID   NA                           NA        
2 yesno      form_1    NA             yesno      YesNo       NA                           NA        
3 text       form_1    NA             text       Text        NA                           NA        
# ℹ abbreviated name: ¹​select_choices_or_calculations
# ℹ 11 more variables: text_validation_type_or_show_slider_number <chr>, text_validation_min <chr>,
#   text_validation_max <chr>, identifier <chr>, branching_logic <chr>, required_field <chr>,
#   custom_alignment <chr>, question_number <chr>, matrix_group_name <chr>, matrix_ranking <chr>,
#   field_annotation <chr>

With our current setup, I suppose the best bet is to check each variable type in the metadata, then check for values that don't coincide with expected values. I'm not sure how we would do this for plain text fields. This may also add more processing time, which supports not having this as a default behavior.

@pbilodeau94
Copy link
Author

pbilodeau94 commented Mar 19, 2024 via email

@ezraporter
Copy link
Collaborator

It looks like the missing data codes are returned in the Export Project Info API method so we theoretically could get them. REDCapR::redcap_project_info_read() returns them but it's locked up in the dev version. I would very much like to avoid getting them but I'm not sure we have a good choice.

There's really only 3 cases we currently have with processing fields of different types. For the first 2 it's pretty trivial to remove missing data codes without knowing what they are. For the third it's basically impossible. That leaves us in a bind if we want a solution that:

  1. Treats all data types equally w/r/t missing data codes. Either convert them to NA everywhere or nowhere
  2. Doesn't need to know the missing data codes ahead of time

I think the least of all evils is to convert to NA where we can without knowing the codes for now (cases 1 and 2) and finish the job once REDCapR 1.2.0 is on CRAN (case 3). If we hate that transitional plan we query the API directly as a holdover.

Case 1: yesno, truefalse, checkbox

Current behavior: Convert to logical with as.logical()

Possible new behavior: Check for non-1/0/NA and warn if found. Convert those to NA then convert to logical.

Case 2: dropdown, radio

Current behavior: Convert to factor using levels read from metadata. (This implicitly converts missing codes to NA since they aren't in the metadata.)

Possible new behavior: Check for data values not in the levels read from metadata and warn if found. Convert to factor, implicitly converting missing codes to NA.

Case 3: Everything else

Current behavior: Do nothing

Possible new behavior: For now, do nothing. When REDCapR 1.2.0 is on CRAN, read the MDCs from project info and convert to NA with a warning.

We can add an option to silence warnings so users can shut them off when they don't have control of the REDCap project.

@rsh52
Copy link
Collaborator

rsh52 commented Mar 19, 2024

After further discussion some additional behavioral things to note for MDCs:

  • checkbox fields automatically get an additional column added to the output, i.e. checkbox___unk for an UNK MDC
  • Determining and applying MDCs as factor levels for dropdown / radio would be difficult, and for projects where an MDC value hasn't been used yet is impossible without using redcap_project_info_read()
  • Text fields would be impossible to fully determine, considering the example where UNK could be a project MDC, and a user could still supply "UNK" as a value

@skadauke
Copy link
Collaborator

skadauke commented Mar 20, 2024

Considering that I suspect only a small subset of REDCap databases use MDCs, possibly breaking changes to the supertibble data model, and likely performance hit with robustly supporting MDCs, I think what Ezra outlines above is a reasonable solution for now (i.e. implement cases 1 and 2). We can consider implementing case (3) after someone demonstrates the need for this by opening a feature request. We can write a warning message that states that we are not currently supporting MDCs but if this is needed they should file an issue on our GitHub page.

@ezraporter
Copy link
Collaborator

I noticed when working on this that we currently don't convert fields to logical when raw_or_label = "raw". Do we like that behavior? I don't have strong opinions either way.

@rsh52
Copy link
Collaborator

rsh52 commented Mar 21, 2024

I noticed when working on this that we currently don't convert fields to logical when raw_or_label = "raw". Do we like that behavior? I don't have strong opinions either way.

Like how yesno fields are 1/0 for "raw" but TRUE/FALSE for "label"? I also don't feel strongly, 1/0 is what REDCap itself uses for raw values for these fields so it seems ok to keep as is.

image

@ezraporter
Copy link
Collaborator

Like how yesno fields are 1/0 for "raw" but TRUE/FALSE for "label"? I also don't feel strongly, 1/0 is what REDCap itself uses for raw values for these fields so it seems ok to keep as is.

Yeah okay I agree and it actually will keep our code simpler to keep as is. PR coming soon but I'm leaning towards also keeping the missing data codes when raw_or_label = "raw". That has the advantage of keeping our code simpler and giving users a way to get the MDCs if they really want.

@ezraporter
Copy link
Collaborator

@pbilodeau94 when you have a chance let us know if #182 solved your issue. You can try it out by installing the dev version

@pbilodeau94
Copy link
Author

Yes, looks like it's all set now! Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants