[BUG] Misleading Mixed Data Structure Outputs #204

rsh52 · 2024-09-17T19:09:45Z

Expected Behavior

When joining mixed data structures with read_redcap(..., allow_mixed_structure = TRUE), we expect tables to be made with meaningful, unique primary keys.

Current Behavior

During development of #199, we found that joining tables that both "repeat together" (i.e. as events, RT) and "repeat separate" (i.e. as instruments, RS) don't provide the appropriate primary keys to allow for distinction of RS rows during joins.

How to Reproduce the Bug:

Using the REDCap for a single record as set up below:

Currently read_redcap() gives us the following output:

> sprtbl$redcap_data
[[1]]
# A tibble: 4 × 5
  record_id redcap_event       redcap_form_instance mixed_structure_1 form_status_complete
      <dbl> <chr>                             <dbl> <chr>             <fct>               
1         1 repeating_together                    1 RT1               Complete            
2         1 repeating_together                    2 RT2               Complete            
3         1 repeating_separate                    1 RS1               Complete            
4         1 repeating_separate                    2 RS2               Complete            

[[2]]
# A tibble: 3 × 5
  record_id redcap_event       redcap_form_instance mixed_structure_2 form_status_complete
      <dbl> <chr>                             <dbl> <chr>             <fct>               
1         1 repeating_together                    1 RT1               Complete            
2         1 repeating_together                    2 RT2               Complete            
3         1 repeating_separate                    1 RS1               Complete

Using join_data_tibbles() with a "full join" we get this:

join_data_tibbles(sprtbl, x = "mixed_structure_1", y = "mixed_structure_2", type = "full")
# A tibble: 4 × 8
  record_id redcap_event       redcap_form_instance mixed_structure_1 redcap_event_instance mixed_structure_2 form_status_complete.x form_status_complete.y
      <dbl> <chr>                             <dbl> <chr>             <lgl>                 <chr>             <fct>                  <fct>                 
1         1 repeating_together                    1 RT1               NA                    RT1               Complete               Complete              
2         1 repeating_together                    2 RT2               NA                    RT2               Complete               Complete              
3         1 repeating_separate                    1 RS1               NA                    RS1               Complete               Complete              
4         1 repeating_separate                    2 RS2               NA                    NA                Complete               NA

The issue here is that in row 3, data for mixed_structure_1 and mixed_structure_2 should exist on separate rows because they are RS instances. As read_redcap() is currently set up, it is impossible to separate these because the primary keys for both are identical (record_id, redcap_event, redcap_form_instance). This is a by product of how we decided to mix redcap_form_instances meaning between repeat events and instruments

Solution Proposal

To fix this we will need to do the following:

Add the repeating event type to the redcap_events column of the supertibble. This will need to identify if an event is a RS/RT for us to reference in other functions like join_data_tibbles()
Revise how redcap_form_instance is used in these examples for RS rows. This may involve revising how redcap_form_instance/redcap_event_instance is defined but boils down to needing an additional primary key to help identify separate repeating instances. @ezraporter after making an example I think redcap_form_instance/redcap_event_instance might still not address this fully since these share the same event information. I think instead we need to identify RS data and then give it an additional column with the form name it came from. This might be able to get shifted back to the join function's responsibilities.

> tibble::tribble(
+   ~"record_id", ~"redcap_event", ~"redcap_form_instance", ~"redcap_event_instance", ~"extra_rs_key", ~"mixed_structure_1", ~"mixed_structure_2", ~"form_status_complete.x", ~"form_status_complete.y",
+   1, "repeat_together", 1, NA, NA, "RT1", "RT1", "Complete", "Complete",
+   1, "repeat_together", 2, NA, NA, "RT2", "RT2", "Complete", "Complete",
+   1, "repeat_separate", 1, NA, "mixed_structure_1", "RS1", NA, "Complete", NA,
+   1, "repeat_separate", 1, NA, "mixed_structure_2", NA, "RS1", NA, "Complete",
+   1, "repeat_separate", 2, NA, "mixed_structure_1", "RS2", NA, "Complete", NA
+   )
# A tibble: 5 × 9
  record_id redcap_event    redcap_form_instance redcap_event_instance extra_rs_key      mixed_structure_1 mixed_structure_2 form_status_complete.x form_status_complete.y
      <dbl> <chr>                          <dbl> <lgl>                 <chr>             <chr>             <chr>             <chr>                  <chr>                 
1         1 repeat_together                    1 NA                    NA                RT1               RT1               Complete               Complete              
2         1 repeat_together                    2 NA                    NA                RT2               RT2               Complete               Complete              
3         1 repeat_separate                    1 NA                    mixed_structure_1 RS1               NA                Complete               NA                    
4         1 repeat_separate                    1 NA                    mixed_structure_2 NA                RS1               NA                     Complete              
5         1 repeat_separate                    2 NA                    mixed_structure_1 RS2               NA                Complete               NA

Checklist

Before submitting this issue, please check and verify below that the submission meets the below criteria:

The issue is atomic
The issue description is documented
The issue title describes the problem succinctly
Developers are assigned to the issue
Labels are assigned to the issue

The text was updated successfully, but these errors were encountered:

rsh52 · 2024-09-17T20:19:50Z

Hm, going to need to think more about this. After doing some debugging, the solution I proposed essentially is just leading us back to re-implementing redcap_repeat_instrument/redcap_event_name which we originally wanted to avoid.

db_data_long
# A tibble: 5 × 8
  record_id redcap_event_name        redcap_repeat_instrument redcap_repeat_instance mixed_structure_1 mixed_structure_1_complete mixed_structure_2 mixed_structure_2_complete
      <dbl> <chr>                    <chr>                                     <dbl> <chr>             <fct>                      <chr>             <fct>                     
1         1 repeating_together_arm_1 NA                                            1 RT1               Complete                   RT1               Complete                  
2         1 repeating_together_arm_1 NA                                            2 RT2               Complete                   RT2               Complete                  
3         1 repeating_separate_arm_1 mixed_structure_1                             1 RS1               Complete                   NA                NA                        
4         1 repeating_separate_arm_1 mixed_structure_1                             2 RS2               Complete                   NA                NA                        
5         1 repeating_separate_arm_1 mixed_structure_2                             1 NA                NA                         RS1               Complete

In trying to separate rows based on RS, what we're looking for is sparsity that we nixed on principle when consolidating repeating instances under redcap_form_instance in our data tibbles.

rsh52 · 2024-09-18T15:22:41Z

RT's need to have their values for redcap_form_instance moved over to redcap_event_instance
RS's need to be joined by record_id and redcap_event and not redcap_form_instance

This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.

What we need to do: update how we determine repeat events and move those values over to redcap_event_instance, then update join_data_tibbles() joining operation.

rsh52 · 2024-09-24T18:41:12Z

This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.

After banging my head against this for a while, I don't think this is actually correct either and am worried that getting this to work would create an extremely opaque output.

Consider this output, with a newly-implemented .repeat_type col taken from the updates in #205 where we can now access the repeat event type from the redcap_events col of the supertibble. This uses the extra mixed test db:

Browse[1]> x
# A tibble: 4 × 7
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete .repeat_type   
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete             repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete             repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete             repeat_separate
4         1 repeating_separate                    2                    NA RS2               Complete             repeat_separate
Browse[1]> y
# A tibble: 3 × 7
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_2 form_status_complete .repeat_type   
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete             repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete             repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete             repeat_separate
Browse[1]> dplyr::full_join(x,y,by)
# A tibble: 4 × 10
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete.x .repeat_type.x  mixed_structure_2 form_status_complete.y .repeat_type.y 
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                  <chr>           <chr>             <fct>                  <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete               repeat_together RT1               Complete               repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete               repeat_together RT2               Complete               repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete               repeat_separate RS1               Complete               repeat_separate
4         1 repeating_separate                    2                    NA RS2               Complete               repeat_separate NA                NA                     NA

Before we said that the desired output would be for those last three NAs in row 4 to be repeats of the values in row 3. But this doesn't make sense of what should be done with the extra data in y that also qualifies based on the join-by columns (i.e. repeating separate instance 2+) since we're not using redcap_form_instance as a by column. I tried out separating the behaviors of these into RT x and y tables, nonrepeating form RS event tables, and repeating form RS event tables, joining by the columns that individually mattered, and binding the rows. But this output is complicated and untraceable for a user.

I'll think some more on this, but think this warrants a larger discussion. In the end we need to make sure we "intelligently join" on columns that are transparent to the user and handling this succinctly with the various behaviors an individual form/table gives proves very challenging.

rsh52 added the bug Something isn't working label Sep 17, 2024

rsh52 self-assigned this Sep 17, 2024

rsh52 mentioned this issue Sep 20, 2024

Missing Data Fix, Partial Keys Fix #205

Merged

13 tasks

rsh52 closed this as completed in #205 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Misleading Mixed Data Structure Outputs #204

[BUG] Misleading Mixed Data Structure Outputs #204

rsh52 commented Sep 17, 2024

rsh52 commented Sep 17, 2024

rsh52 commented Sep 18, 2024 •

edited

Loading

rsh52 commented Sep 24, 2024

[BUG] Misleading Mixed Data Structure Outputs #204

[BUG] Misleading Mixed Data Structure Outputs #204

Comments

rsh52 commented Sep 17, 2024

Expected Behavior

Current Behavior

How to Reproduce the Bug:

Solution Proposal

Checklist

rsh52 commented Sep 17, 2024

rsh52 commented Sep 18, 2024 • edited Loading

rsh52 commented Sep 24, 2024

rsh52 commented Sep 18, 2024 •

edited

Loading