Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctK2() drops samples under specific conditions #95

Open
broomej opened this issue Aug 22, 2022 · 1 comment
Open

correctK2() drops samples under specific conditions #95

broomej opened this issue Aug 22, 2022 · 1 comment

Comments

@broomej
Copy link

broomej commented Aug 22, 2022

Hi team,

I ran into a very unusual bug where calling correctK2() drops rows when called in a specific instance. I have an implementation based on your analysis pipeline, but because I replaced rbind() with dplyr::bind_rows() I ran into this issue

library(GENESIS)
sessionInfo()
packageVersion("dplyr")
# [1] ‘1.0.8’


b_kinSelf <- NULL
b_kinBtwn <- NULL
i <- 5
j <- 5
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

i <- 4
j <- 4
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

kinself_bind_rows <- b_kinSelf
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Stream 8

Matrix products: default
BLAS/LAPACK: /home/src/intel/compilers_and_libraries_2020.1.217/linux/mkl/li
b/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GENESIS_2.20.1 nvimcom_0.9-88

loaded via a namespace (and not attached):
 [1] quantsmooth_1.56.0     Rcpp_1.0.8.2           lattice_0.20-41
 [4] tidyr_1.2.0            Biostrings_2.58.0      formula.tools_1.7.1
 [7] zoo_1.8-9              assertthat_0.2.1       lmtest_0.9-38
[10] foreach_1.5.2          utf8_1.2.2             R6_2.5.1
[13] GenomeInfoDb_1.26.7    backports_1.4.1        MatrixModels_0.5-0
[16] gdsfmt_1.26.1          stats4_4.0.5           RSQLite_2.2.7
[19] pillar_1.7.0           zlibbioc_1.36.0        rlang_1.0.2
[22] data.table_1.14.0      SparseM_1.81           blob_1.2.1
[25] S4Vectors_0.28.1       Matrix_1.3-2           splines_4.0.5
[28] GWASExactHW_1.01       RCurl_1.98-1.6         bit_4.0.4
[31] broom_0.7.12           compiler_4.0.5         pkgconfig_2.0.3
[34] BiocGenerics_0.36.1    mgcv_1.8-34            tidyselect_1.1.2
[37] GenomeInfoDbData_1.2.4 tibble_3.1.6           DNAcopy_1.64.0
[40] IRanges_2.24.1         codetools_0.2-18       fansi_1.0.2
[43] crayon_1.5.0           dplyr_1.0.8            bitops_1.0-7
[46] grid_4.0.5             nlme_3.1-152           lifecycle_1.0.1
[49] DBI_1.1.2              magrittr_2.0.2         SeqVarTools_1.28.1
[52] cli_3.2.0              cachem_1.0.6           XVector_0.30.0
[55] SNPRelate_1.24.0       mice_3.13.0            ellipsis_0.3.2
[58] generics_0.1.2         vctrs_0.3.8            sandwich_3.0-1
[61] iterators_1.0.14       tools_4.0.5            bit64_4.0.5
[64] Biobase_2.50.0         glue_1.6.2             purrr_0.3.4
[67] parallel_4.0.5         fastmap_1.1.0          survival_3.2-10
[70] SeqArray_1.30.0        GenomicRanges_1.42.0   GWASTools_1.36.0
[73] operator.tools_1.6.3   memoise_2.0.1          logistf_1.24
[76] quantreg_5.88
[1] "block:"
[1] 5 5
[1] "before correction nrow res$kinBtwn:"
[1] 10335331
[1] "after correction nrow res$kinBtwn:"
[1] 10335331

[1] "block:"
[1] 4 4
[1] "before correction nrow res$kinBtwn:"
[1] 10330785
[1] "after correction nrow res$kinBtwn:"
[1] 2580151

I was able to track the problem down to lines 630 and 632 in the correctK2() function definition where it calls merge(); manually running equivalent code caused merge to behave in the same way. I think this is possibly a bug in the data.table method for merge(), but I'm not totally sure.

If I change only b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf) to b_kinSelf <- rbind(b_kinSelf, res$kinSelf), I do not trigger this bug:

# WORKING COUNTEREXAMPLE
b_kinSelf <- NULL
b_kinBtwn <- NULL
i <- 5
j <- 5
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- rbind(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

i <- 4
j <- 4
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- rbind(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

kinself_rbind <- b_kinSelf
[1] "block:"
[1] 5 5
[1] "before correction nrow res$kinBtwn:"
[1] 10335331
[1] "after correction nrow res$kinBtwn:"
[1] 10335331

[1] "block:"
[1] 4 4
[1] "before correction nrow res$kinBtwn:"
[1] 10330785
[1] "after correction nrow res$kinBtwn:"
[1] 10330785

The only difference I see between kinself_rbind and kinself_bind_rows is the latter has an extra attribute

> str(kinself_rbind)
Classes ‘data.table’ and 'data.frame':  9093 obs. of  3 variables:
 $ ID  : chr  [redacted]
 $ f   : num  0.000501 -0.016491 -0.0036 -0.00307 -0.008572 ...
 $ nsnp: num  197276 198380 198390 198323 198318 ...
 - attr(*, ".internal.selfref")=<externalptr>
> str(kinself_bind_rows)
Classes ‘data.table’ and 'data.frame':  9093 obs. of  3 variables:
 $ ID  : chr  [redacted]
 $ f   : num  0.000501 -0.016491 -0.0036 -0.00307 -0.008572 ...
 $ nsnp: num  197276 198380 198390 198323 198318 ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr "ID"
> all.equal(kinself_rbind$ID, kinself_bind_rows$ID)
[1] TRUE
> all.equal(kinself_rbind$f, kinself_bind_rows$f)
[1] TRUE
> all.equal(kinself_rbind$nsnp, kinself_bind_rows$nsnp)
[1] TRUE

I know that's a lot of details to throw at you when the bug might be with a dependency, rather than your package, but I wanted to flag it in case there are circumstances where this might commonly crop up. Let me know if there's any other details I can give you and I have this example along with the example data on the Blue Lab servers if @smgogarten wants to take a look

@smgogarten
Copy link
Collaborator

I was able to reproduce this with a different dataset in the current R version, only in my case, using dplyr::bind_rows results in a 0-element data.frame for res$kinBtwn. I don't see a fix here other than to avoid mixing tidyverse and data.table functions, since I don't understand enough of the inner workings of either package to know how they are conflicting in this case. Possibly the sorted attribute added by dplyr:bind_rows has a different meaning in data.table.

> sessionInfo()
R version 4.2.1 (2022-06-23)
GENESIS_2.27.1
data.table_1.14.4
dplyr_1.0.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants