Misleading errors when reading sparse data without a record_id column #471

pbchase · 2023-02-21T16:17:37Z

I am seeing odd behavior from redcap_read when reading a single form of data. This is with REDCap 1.1.0, R 4.2.2, REDCap 12.4.2. If I read a sparsely-filled form--a form that does not have a row of data for every person described by the record_id column--I get this confusing error message:

> result <- REDCapR::redcap_read(
+   batch_size = 1000,
+   redcap_uri = source_credentials$redcap_uri,
+   token = source_credentials$token,
+   forms = c("next_visit_form")
+ )
The data dictionary describing 9,457 fields was read from REDCap in 1.3 seconds.  The http status code was 200.
3,515 records and 2 columns were read from REDCap in 0.8 seconds.  The http status code was 200.                                                                                                        
Starting to read 772 records  at 2023-02-21 10:51:53.
Reading batch 1 of 1, with subjects 110001 through A17-001 (ie, 772 unique subject records).
3,515 records and 5 columns were read from REDCap in 0.9 seconds.  The http status code was 200.                                                                                                        

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  visitdate_ideal = col_date(format = ""),
  visitdate_earliest = col_date(format = ""),
  visitdate_latest = col_date(format = ""),
  visitdate_scheduled = col_date(format = ""),
  next_visit_form_complete = col_double()
)

Error in REDCapR::redcap_read(batch_size = 1000, redcap_uri = source_credentials$redcap_uri,  : 
  There are 772 subject(s) that are missing rows in the returned dataset. REDCap's PHP code is likely trying to process too much text in one bite.

Common solutions this problem are:
  - specifying only the records you need (w/ `records`)
  - specifying only the fields you need (w/ `fields`)
  - specifying only the forms you need (w/ `forms`)
  - specifying a subset w/ `filter_logic`
  - reduce `batch_size`

The missing ids are:
110001,110002,110003,110004,110005,110006,110007,110008,110009,110010,110011,110012,110013,110014,110015,110016,110017,110018,110019,110020,110021,110022,110023,110024,110025,110026,110027,110028,110029,110030,110031,110032,110033,110034,110036,110037,110038,110039,110040,110041,110042,110043,110044,110045,110046,110047,110048,110049,110050,110051,110052,110053,110054,110055,110056,110057,110058,110059,110060,110061,110062,110063,110064,110065,110066,110067,110068,110069,110070,110071,110072,110073,110074,110075,110076,110077,110078,110079,110080,110081,

All of the suggestions are misleading. The real fix is to add the record_id column (ptid in my case) e.g.,

> result <- REDCapR::redcap_read(
+   batch_size = 1000,
+   redcap_uri = source_credentials$redcap_uri,
+   token = source_credentials$token,
+   fields = c("ptid"),
+   forms = c("next_visit_form")
+ )
The data dictionary describing 9,457 fields was read from REDCap in 1.2 seconds.  The http status code was 200.
3,516 records and 2 columns were read from REDCap in 0.6 seconds.  The http status code was 200.                                                                                                        
Starting to read 773 records  at 2023-02-21 11:08:12.
Reading batch 1 of 1, with subjects 110001 through A17-001 (ie, 773 unique subject records).
3,516 records and 7 columns were read from REDCap in 0.8 seconds.  The http status code was 200.                                                                                                        

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ptid = col_character(),
  redcap_event_name = col_character(),
  visitdate_ideal = col_date(format = ""),
  visitdate_earliest = col_date(format = ""),
  visitdate_latest = col_date(format = ""),
  visitdate_scheduled = col_date(format = ""),
  next_visit_form_complete = col_double()
)

Whereas I say the error message is misleading, I'm not sure adjusting it is the fix. I feel like we used to always get the record_id column even if we never asked for it. That makes sense. Reading data without its primary key is silly unless you are anonymizing data. That hardly seems like a good default. Did that change? Was it supposed to?

The text was updated successfully, but these errors were encountered:

wibeasley · 2023-02-21T20:03:08Z

Did that change? Was it supposed to?

Yes, I think you're right, but I'm not seeing anything pop when I quickly skim the release notes since v8). I feel it happened about the same time that the record_id would be hashed if the user didn't have PHI rights.

There's always a chance that I'm wrong and REDCap is returning the field but REDCapR is not. I can look deeper if it matters.

Either way, I'm leaning in favor of letting the user decide to omit the variable (even if it's more frequently an accident than intentional).

Do you have a strong preference to always include it? Are you still in favor of just adding another bullet in the error message quoted above --it sounded like that was your first reaction.

pbchase · 2023-02-22T16:27:19Z

I would honor what the API does. If it omits the record_id, when the caller does not name it, honor that.

As for what action to take, I think you should replace the hard stop with a warning. To the warning text, add a bullet that says something like, "...or perhaps the fields and forms you requested are sparse" and also change the hard stop to a warning. If REDCapR's assessment of the situation could be wrong, the hard stop causes problems with a non-obvious fix. I found the hard stop annoying more than anything. I knew very well the form I was querying was sparse and REDCapR wouldn't let me run my query. If I could have seen the result, I could assess that sparseness, decide if the error messages were accurate or not, note that the record_id column was missing from data, add it, and even address the "problem".

wibeasley · 2023-04-23T15:24:41Z

I think the real fix is to unambiguously determine the name of record_id if it's been renamed. I've posted this feature/question on Community: https://redcap.vanderbilt.edu/community/post.php?id=203859.

This would prevent the guessing of the record_id field, which in this scenario will always fail because it is not requested/returned.

REDCapR/R/redcap-read.R

Lines 448 to 449 in 18b9663

    
           unique_ids_actual <- sort(unique(ds_stacked[[id_position]])) 
        
           ids_missing_rows  <- setdiff(unique_ids, unique_ids_actual)

simX · 2023-07-05T16:59:12Z

FWIW, I just migrated to a new Mac where I did a fresh install of R, RStudio, and REDCapR, and I am seeing this exact issue, whereas it's not happening on my older Mac (presumably because I haven't updated the REDCapR package). I troubleshooted myself and came to the same conclusion, that you need to include the name of the record_id field and include it in the list of fields you request, otherwise you get the misleading error posted in the initial issue. I do also have one project where record_id has been renamed to something else, so that threw me for a loop for a bit.

I have no real preference on whether the record_id field should always be included, though I don't think it hurts to always include it since it's the de facto id field for records in that project. But the user of REDCapR should not be required to request this field to get any data at all, as I don't think that's clear from the documentation (which makes sense since this is a recent change in behavior).

simX · 2023-07-05T19:03:59Z

FYI I just checked on my other computers: this issue does not occur with REDCapR package version 1.0.0. It does occur with version 1.1.0. I used getNamespaceVersion("REDCapR") to figure out the currently loaded package version.

pbchase · 2023-07-06T14:35:12Z

I think the real fix is to unambiguously determine the name of record_id if it's been renamed. I've posted this feature/question on Community: https://redcap.vanderbilt.edu/community/post.php?id=203859.

This would prevent the guessing of the record_id field, which in this scenario will always fail because it is not requested/returned.

@wibeasley, the record_id column is always the first row in the metadata. Don't you always query the metadata? If so, you would have the name of the record_id field name before you queried for data.

If that feels too hacky, you could write your own method to read the record_id column from the metadata, then reference that method to get the record_id column name. Then you have the easy option to revise the method when a better, REDCap-official way is released.

simX · 2023-07-06T17:40:31Z

Is it worth trying to figure out why version 1.0.0 of the REDCapR package still works in the previous fashion, without requiring you to include the record_id field name in the list of fields to query?

wibeasley · 2023-07-06T17:48:00Z

I'll have time tomorrow to look at this closely. I remember starting on this issue at least. The id_position parameter was my solution until I REDCap supports something like https://redcap.vanderbilt.edu/community/post.php?id=203859. @pbchase, does this address what you're suggesting

https://ouhscbbmc.github.io/REDCapR/reference/redcap_read.html

@simX, in the meantime, can you try the version on GitHub and see if it does what you want? Install with remotes::install_github("OuhscBbmc/REDCapR").

pbchase · 2023-07-06T18:30:06Z

@pbchase, does this address what you're suggesting

https://ouhscbbmc.github.io/REDCapR/reference/redcap_read.html

@wibeasley, yes, the id_position parameter will work fine.

wibeasley · 2023-07-09T17:55:51Z

@simX, I think this has been fixed for a while with the main version on GitHub. Here's how the two read functions operate. Is this a problem for your scenario?

uri      <- "https://bbmc.ouhsc.edu/redcap/api/"
token    <- "22C3FF1C8B08899FB6F86D91D874A159" # pid 3181

REDCapR::redcap_read_oneshot(
  redcap_uri  = uri, 
  token       = token,
  forms       = "blood_pressure",
  verbose     = FALSE 
)$data
#> # A tibble: 6 × 3
#>     sbp   dbp blood_pressure_complete
#>   <dbl> <dbl>                   <dbl>
#> 1   1.1  11.1                       2
#> 2   1.2  11.2                       2
#> 3   1.3  11.3                       2
#> 4   2.1  22.1                       2
#> 5   2.2  22.2                       2
#> 6   2.3  22.3                       2

REDCapR::redcap_read(
  redcap_uri  = uri, 
  token       = token,
  forms       = "blood_pressure",
  verbose     = FALSE 
)$data
#> # A tibble: 8 × 6
#>   record_id redcap_repeat_instrument redcap_repeat_instance   sbp   dbp
#>       <dbl> <chr>                                     <dbl> <dbl> <dbl>
#> 1         1 <NA>                                         NA  NA    NA  
#> 2         1 blood_pressure                                1   1.1  11.1
#> 3         1 blood_pressure                                2   1.2  11.2
#> 4         1 blood_pressure                                3   1.3  11.3
#> 5         2 <NA>                                         NA  NA    NA  
#> 6         2 blood_pressure                                1   2.1  22.1
#> 7         2 blood_pressure                                2   2.2  22.2
#> 8         2 blood_pressure                                3   2.3  22.3
#> # ℹ 1 more variable: blood_pressure_complete <dbl>

^{Created on 2023-07-09 with reprex v2.0.2}

simX · 2023-07-09T18:20:57Z

I’ll take a look in a couple days with the version from GitHub. It is possible it’s already fixed, I didn’t realize there were further changes beyond version 1.1.0.

simX · 2023-07-11T19:27:35Z

Looks like the GitHub version works properly; it doesn't require including the record_id field in the list of fields to retrieve anymore.

I installed the remotes package from CRAN using RStudio's [Tools --> Install Packages...] menu item, and then installed the GitHub version of REDCapR using remotes::install_github("OuhscBbmc/REDCapR"). I did have to quit and restart RStudio and not save the workspace data (presumably because I would need to unload version 1.1.0 and load the GitHub version), but all works fine now. Thanks for the assistance!

when not requested ref #471

ref #471

wibeasley · 2023-07-11T21:57:25Z

@simX, sweet. Glad to hear it's working. (In the future, I think you need to restart just R within RStudio, not restart all of RStudio.)

@pbchase, I think the modified behavior (since Feb) satisfies your first two posts in this issue. But I'd like to hear it from you (that I'm not overlooking something) before I close the issue.

In short, REDCapR::redcap_read_oneshot() doesn't add record_id but REDCapR::redcap_read() does add it (it's unavoidable, because there is so much stitching and so many meta-ish commands). But neither throw an error when it's not requested --as confirmed by the four test functions in the commit immediately above.

*crap, the message from commit ab4c58a should be "verify record_id IS returned by redcap_read".

wibeasley added a commit that referenced this issue Jul 11, 2023

verify record_id not returned by redcap_read_oneshot()

cc6da98

when not requested ref #471

wibeasley added a commit that referenced this issue Jul 11, 2023

verify record_id IS returned by redcap_read_oneshot()

ab4c58a

ref #471

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading errors when reading sparse data without a record_id column #471

Misleading errors when reading sparse data without a record_id column #471

pbchase commented Feb 21, 2023

wibeasley commented Feb 21, 2023

pbchase commented Feb 22, 2023

wibeasley commented Apr 23, 2023 •

edited

Loading

simX commented Jul 5, 2023

simX commented Jul 5, 2023

pbchase commented Jul 6, 2023

simX commented Jul 6, 2023

wibeasley commented Jul 6, 2023

pbchase commented Jul 6, 2023

wibeasley commented Jul 9, 2023

simX commented Jul 9, 2023

simX commented Jul 11, 2023 •

edited

Loading

wibeasley commented Jul 11, 2023 •

edited

Loading

Misleading errors when reading sparse data without a record_id column #471

Misleading errors when reading sparse data without a record_id column #471

Comments

pbchase commented Feb 21, 2023

wibeasley commented Feb 21, 2023

pbchase commented Feb 22, 2023

wibeasley commented Apr 23, 2023 • edited Loading

simX commented Jul 5, 2023

simX commented Jul 5, 2023

pbchase commented Jul 6, 2023

simX commented Jul 6, 2023

wibeasley commented Jul 6, 2023

pbchase commented Jul 6, 2023

wibeasley commented Jul 9, 2023

simX commented Jul 9, 2023

simX commented Jul 11, 2023 • edited Loading

wibeasley commented Jul 11, 2023 • edited Loading

wibeasley commented Apr 23, 2023 •

edited

Loading

simX commented Jul 11, 2023 •

edited

Loading

wibeasley commented Jul 11, 2023 •

edited

Loading