rrarefy keeping unrarefied results of samples with >depth than subsample #537
Replies: 26 comments 2 replies
-
It's impossible to tell as you haven't provided any code that you ran or specific errors raised. Lines 19 to 20 in e5f9251 Did you get that warning when you called That said, this in Line 20 in e5f9251 The check should be more like inputcast <- inputcast[c(rowSums(inputcast) >= sample), ] |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Your "depths" seem to be larger than R integer maximum > x <- c(2517543402, 3066348209, 303772148)
> x > .Machine$integer.max
[1] TRUE TRUE FALSE
> .Machine$integer.max
[1] 2147483647 This depends on invidual terms (values of cells) instead of total, though. You may see if you get the integer range error by calling directly the C code in > .Call(vegan:::do_rrarefy, x, 10)
[1] 0 0 10
Warning message:
NAs introduced by coercion to integer range If you get the integer range error, then your numbers are too large for C code. |
Beta Was this translation helpful? Give feedback.
-
I am assuming here that the 'x' is my column? Im sorry, im not familiar with calling the C code |
Beta Was this translation helpful? Give feedback.
-
Nevermind, I read your response carefully this time, and i see where you created the 'x' variable. I understand now. Ill run it and see what happens |
Beta Was this translation helpful? Give feedback.
-
@TJrogers86 if .Call(vegan:::do_rrarefy, for_rare[3,], 10) # for row 3 |
Beta Was this translation helpful? Give feedback.
-
@jarioksa Thanks again for the help, but I tried this on several rows including the one with the largest rowSums and it never throw the error. |
Beta Was this translation helpful? Give feedback.
-
Sorry, I mean it never through the warning message |
Beta Was this translation helpful? Give feedback.
-
Actually, I just ran it on the test <- data.frame(for_rare_dist_matrix)
test2 <- rowSums(test)
test3 <- sort(.Call(vegan:::do_rrarefy, test2, 10)) and this did through the error. I think you are correct. And I think I know how to fix my issue. Thanks again |
Beta Was this translation helpful? Give feedback.
-
I suspect you shouldn't be rarefying your data like this - or, at least expect to get some push-back from some reviewers if you do rarefy data such as this. With numbers as large as this, perhaps consider dividing them by some multiple of 10 and truncating / rounding the values to integers (if you want to still do rarefied counts). Or look at other transformations such as a variance stabilizing transformation or a regularized log-transformation (see package DESeq2 in the Bioconductor suite for both of these) or |
Beta Was this translation helpful? Give feedback.
-
@gavinsimpson Thanks for the advice and info! These are counts that have been adjusted for contig length. To keep low count numbers from becoming NA, I had multiple by a big number (10^7) and round to the closes interger. That caused my counts to be as high as they were. Adjusting 10^7 to 10^6 and rounding brought the counts into a level that can be used without issue. If i bring it down to 10^5 then I'll have NA's. So, Im hoping this is acceptable. I plan to use rarefied df and matrix for alpha and beta diversity measures respectively and will use relative abundance otherwise. Any advice/suggestions are welcome. Thanks for y'alls help thus far! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Further to @jarioksa's answer, the contig length (which I don't understand the implications of too well as I'm not well-versed in this high-throughput world) would be something that we would normally handle via an offset term in a statistical model. The problem we have in the multivariate world of ordination is that we can't apply an offset (newer methods are starting to become available, so-called model-based ordination, which do allow offsets, but these new methods are simply not ready for use with data such as yours with a large number of variables), so people look for alternatives. I personally don't understand the logic behind That said, I also wouldn't do that in general or in your case. As Jari said, I would try to work with the raw counts as much as possible and then use an offset (based on the size factors and other considerations) when fitting classical statistical models to your data (e.g. GLMs or GLMMs) or a transformations of the raw count data that tries to account for the extra dispersion in the data that arises due to these methodological considerations with high throughput data. |
Beta Was this translation helpful? Give feedback.
-
As for some suggestions on how to proceed, I would suggest taking a look at some of the work of Susan Holmes and colleagues: Sankaran, K., Holmes, S.P., 2019. Latent variable modeling for the microbiome. Biostatistics 20, 599–614. https://doi.org/10.1093/biostatistics/kxy018 Jeganathan, P., Holmes, S.P., 2021. A Statistical Perspective on the Challenges in Molecular Microbial Biology. J. Agric. Biol. Environ. Stat. 26, 131–160. https://doi.org/10.1007/s13253-021-00447-1 Sankaran, K., Holmes, S.P., 2019. Multitable Methods for Microbiome Data Integration. Front. Genet. 10, 627. https://doi.org/10.3389/fgene.2019.00627 At least these would give you a better understanding of the issues (though note Susan is firmly in the "thou shalt not rarefy" camp) and give you some ideas about how statisticians are attempting to tackle the problems. |
Beta Was this translation helpful? Give feedback.
-
@gavinsimpson Species averages of randomly rarefied samples converge to observed (non-rarefied) species frequencies adjusted to sub-sample size. |
Beta Was this translation helpful? Give feedback.
-
@jarioksa isn't that what people want, to get counts adjusted to the sub-sample size? |
Beta Was this translation helpful? Give feedback.
-
Or did you mean you could just avoid the random sampling and simply use data in proportion to the observed frequencies? |
Beta Was this translation helpful? Give feedback.
-
@gavinsimpson not really: they want equalized sample sizes, and with averages they get non-rarefied numbers. For a subsample of 50 in BCI, the species averages converge |
Beta Was this translation helpful? Give feedback.
-
FWIW, adding a warning on >1 minimum count in |
Beta Was this translation helpful? Give feedback.
-
@jarioksa I see what you mean. Did you look at that test in Line 20 in e5f9251 If it is suss - I was just reading code that I've never run - fixing this before the release would be good, which I can do. |
Beta Was this translation helpful? Give feedback.
-
@gavinsimpson The main author of |
Beta Was this translation helpful? Give feedback.
-
@jarioksa OK - there are a couple of other things in that function that I want to fix so I'll do a PR with all the changes and then ask @Microbiology to comment. |
Beta Was this translation helpful? Give feedback.
-
Hey guys. Thanks a lot for the help and suggestions. I have read some of the dislike of rarefy and was going to avoid it, but I came across Dr. Pat Schloss' YouTube channel and his videos demonstrating why he advocates for rarefying when it comes to diversity and richness analyses and it was convincing enough to try it (I believe he was the aforementioned @Microbiology Dr. Geoffrey Hannigan's Advisor at one point). If your interested, I've posted links below to said videos for rational and demonstration. If you have time to watch them, I'd like your thoughts on them. In any case, thanks again for the reading suggestions and for this great discussion! To rarefy or not to rarefy: Rarefy vs Normalize: Minimizing sampling effort impact on bray curtis |
Beta Was this translation helpful? Give feedback.
-
Thanks all. I just went over to the PR and it looks good from my end. |
Beta Was this translation helpful? Give feedback.
-
@TJrogers86 I would say that in many of these discussions about rarefying counts the discussants are talking past one another. Pat's videos (IIRC; I haven't watched all of this YouTube channel) on this topic focus on the impact of rarefying on computing dissimilarities. Many people use That said, just because doing the Personally, as someone who has only relatively recently started to be dragged into working with data like this, I'm not keen on using rarefying counts via random draws using |
Beta Was this translation helpful? Give feedback.
-
I felt this was likely better off as a discussion as the original problem wasn't an issue with {vegan} per se, and the conversation had veered off into discussion territory so it was best to close the issue but keep the discussion so it can be surfaced by Google should others have similar issues. |
Beta Was this translation helpful? Give feedback.
-
Hello developers!
Im having a strange issue with rrarefy that I cant seem to figure out. At first, I thought it was a problem. I have a count table that is 77x10493 (samples are row names and vContigs are column names). When I use rowSums(), I find that the lowest read depth is 5892632 (sample ARS). So, when using avgdist() I set sample=5000000. However, I get this warning message: "The following sampling units were removed because they were below sampling depth: CZF, CIF, RCF". But, their read depths are 2517543402, 3066348209, and 303772148 respectively. I know rrarefy is used in avedist, so I tested rrafey with the same cutoff. As rrarefy returns the unrarefied numbers if sample deepth is lower that the selected cut off, I looked at the rrarefied cuts. All of my samples had been rarefied except for CZF, CIF, and RCF. Am I doing something wrong?
Beta Was this translation helpful? Give feedback.
All reactions