rrarefy keeping unrarefied results of samples with >depth than subsample #537

DeaconOfBiology · 2022-10-03T11:57:54Z

DeaconOfBiology
Oct 3, 2022

Hello developers!
Im having a strange issue with rrarefy that I cant seem to figure out. At first, I thought it was a problem. I have a count table that is 77x10493 (samples are row names and vContigs are column names). When I use rowSums(), I find that the lowest read depth is 5892632 (sample ARS). So, when using avgdist() I set sample=5000000. However, I get this warning message: "The following sampling units were removed because they were below sampling depth: CZF, CIF, RCF". But, their read depths are 2517543402, 3066348209, and 303772148 respectively. I know rrarefy is used in avedist, so I tested rrafey with the same cutoff. As rrarefy returns the unrarefied numbers if sample deepth is lower that the selected cut off, I looked at the rrarefied cuts. All of my samples had been rarefied except for CZF, CIF, and RCF. Am I doing something wrong?

gavinsimpson · 2022-10-03T12:52:01Z

gavinsimpson
Oct 3, 2022
Maintainer

It's impossible to tell as you haven't provided any code that you ran or specific errors raised. rrarefy has

vegan/R/rrarefy.R

Lines 19 to 20 in e5f9251

    
           if (any(rowSums(x) < sample)) 
        
               warning("some row sums < 'sample' and are not rarefied")

Did you get that warning when you called rrarefy()?

That said, this in avgdist() is almost surely wrong

vegan/R/avgdist.R

Line 20 in e5f9251

inputcast <- inputcast[c(rowSums(inputcast) %in% sample),]

The check should be more like

inputcast <- inputcast[c(rowSums(inputcast) >= sample), ]

0 replies

DeaconOfBiology · 2022-10-03T13:09:14Z

DeaconOfBiology
Oct 3, 2022
Author

Thank you for the quick response and sorry for not providing my code. Hopeful this isnt too winded: my otu count table is called "abun_table2". Before I transpose it, it is in this form 10493x77 (samples are column names and vContigs are row names). Here is an image of the count table:

Before I run it through avgdist or rrarefy, i modify it with the following code:

for_rare = as.data.frame(abun_table2) %>%
    select(-DS, -ExtractionBlank) %>%
    t %>%
    data.frame %>%
    rownames_to_column(.,var = "Sites") %>%
    pivot_longer(-Sites) %>%
    group_by(Sites) %>%
    dplyr::mutate(total = sum(value)) %>%
    filter(total > 5000000) %>%
    arrange(total) %>%
    group_by(name) %>%
    dplyr::mutate(total = sum(value)) %>% # At this step we have 10,493 OVU's
    filter(total != 0) %>%                # At this step we have 10,462. We have lost 31 OVU's
    ungroup %>%
    select(-total) %>%
    pivot_wider(Sites) %>%
    column_to_rownames(.,var = "Sites") %>%
    as.matrix

The out put is in the form I mentioned above: 77x10,462 (samples are row names and vContigs are column names). Then I run the following:

For rrarefy:

rarefied_data <- rrarefy(for_rare, 5000000)

For avgdist:

distmatrix <- avgdist(for_rare, sample = 5000000,
                        dmethod = "jaccard")

To answer your question, yes I did get the rrarefy warning. I will create my own version of the avgdist with the correction you suggested and see what the results are, however in this case, I believe the issue is with the rrarefy step: either on my end (user error) or maybe my dataframe is too large for rrarefy to function properly (7.4 MB)? Thanks again for your help!

0 replies

jarioksa · 2022-10-03T13:09:44Z

jarioksa
Oct 3, 2022
Maintainer

Your "depths" seem to be larger than R integer maximum

> x <- c(2517543402, 3066348209, 303772148)
> x > .Machine$integer.max
[1]  TRUE  TRUE FALSE
> .Machine$integer.max
[1] 2147483647

This depends on invidual terms (values of cells) instead of total, though. You may see if you get the integer range error by calling directly the C code in rrarefy for a single row of your data:

> .Call(vegan:::do_rrarefy, x, 10)
[1]  0  0 10
Warning message:
NAs introduced by coercion to integer range

If you get the integer range error, then your numbers are too large for C code.

0 replies

DeaconOfBiology · 2022-10-03T13:12:16Z

DeaconOfBiology
Oct 3, 2022
Author

.Call(vegan:::do_rrarefy, x, 10)

I am assuming here that the 'x' is my column? Im sorry, im not familiar with calling the C code

0 replies

DeaconOfBiology · 2022-10-03T13:13:56Z

DeaconOfBiology
Oct 3, 2022
Author

.Call(vegan:::do_rrarefy, x, 10)
I am assuming here that the 'x' is my column? Im sorry, im not familiar with calling the C code

Nevermind, I read your response carefully this time, and i see where you created the 'x' variable. I understand now. Ill run it and see what happens

0 replies

jarioksa · 2022-10-03T13:14:20Z

jarioksa
Oct 3, 2022
Maintainer

@TJrogers86 if for_rare are your data, then something like

.Call(vegan:::do_rrarefy, for_rare[3,], 10) # for row 3

0 replies

DeaconOfBiology · 2022-10-03T13:32:31Z

DeaconOfBiology
Oct 3, 2022
Author

@jarioksa Thanks again for the help, but I tried this on several rows including the one with the largest rowSums and it never throw the error.

0 replies

DeaconOfBiology · 2022-10-03T13:33:16Z

DeaconOfBiology
Oct 3, 2022
Author

Sorry, I mean it never through the warning message

0 replies

DeaconOfBiology · 2022-10-03T13:36:11Z

DeaconOfBiology
Oct 3, 2022
Author

Actually, I just ran it on the rowSums:

test <- data.frame(for_rare_dist_matrix)
test2 <- rowSums(test)
test3 <- sort(.Call(vegan:::do_rrarefy, test2, 10))

and this did through the error. I think you are correct. And I think I know how to fix my issue. Thanks again

0 replies

gavinsimpson · 2022-10-03T15:28:56Z

gavinsimpson
Oct 3, 2022
Maintainer

I suspect you shouldn't be rarefying your data like this - or, at least expect to get some push-back from some reviewers if you do rarefy data such as this.

With numbers as large as this, perhaps consider dividing them by some multiple of 10 and truncating / rounding the values to integers (if you want to still do rarefied counts). Or look at other transformations such as a variance stabilizing transformation or a regularized log-transformation (see package DESeq2 in the Bioconductor suite for both of these) or dispweight() in vegan as alternatives for extra-Poisson dispersion, which often accompanies variation in read depth / library size.

0 replies

DeaconOfBiology · 2022-10-03T19:13:28Z

DeaconOfBiology
Oct 3, 2022
Author

@gavinsimpson Thanks for the advice and info! These are counts that have been adjusted for contig length. To keep low count numbers from becoming NA, I had multiple by a big number (10^7) and round to the closes interger. That caused my counts to be as high as they were. Adjusting 10^7 to 10^6 and rounding brought the counts into a level that can be used without issue. If i bring it down to 10^5 then I'll have NA's. So, Im hoping this is acceptable. I plan to use rarefied df and matrix for alpha and beta diversity measures respectively and will use relative abundance otherwise. Any advice/suggestions are welcome. Thanks for y'alls help thus far!

0 replies

jarioksa · 2022-10-04T09:01:21Z

jarioksa
Oct 4, 2022
Maintainer

I think random rarefaction (rrarefy) is rarely sensible – although considered a must in some fields of science – but it is only meaningful with observed counts. The idea of a random rarefaction sample is that you either include a species or you drop it, but you cannot have a "part" of the species. If you think you need to multiply your observed counts, you should first rarefy and then multiply the rarefied counts. Multiplying before rarefaction clones your data and allows you to have a part of a taxon (OTU) that you only observed once, and the higher the multiplier, the closer the species proportions in rarefied sample become to observed proportions.

I attach a graph of my quick demonstration using the first sampling unit of the BCI data in vegan with 448 trees and 93 species, and 31 species only with one tree (individual). The graph shows the probability of inclusion of a species in a rarefied sample (function drarefy) of size 50. Then I multiplied both the observed counts and the sample size with different multipliers and recalculated the inclusion probabilities. For multiplier 10 there were 4480 trees, but still only the same 93 species, and the rarest 31 species each with 10 trees, and the sample size was 500. Had you collected real data of 4480 trees, there would have been more than 93 species, and the number of species occurring only once would have been about the same as now (and some of the species found only once among 448 trees would perhaps be found only once among 4480 trees – but most of the new singletons would be new species).

I think that rarefaction functions (rarefy, rrarefy, drarefy) are wrongly used in most cases when the lowest non-zero integer is larger than 1. I think I should add a test for this and issue a warning of possible problems. Not an error, because, of course, there can be cases with no singleton species. There are at least two user cases that falsely remove singletons: multiplication of observed counts and removing rarest species from data. Both of these cases are unsuitable for rarefaction and need warning. If I do this, I guess several BioC and genomics packages will warnings.

0 replies

gavinsimpson · 2022-10-04T09:28:48Z

gavinsimpson
Oct 4, 2022
Maintainer

Further to @jarioksa's answer, the contig length (which I don't understand the implications of too well as I'm not well-versed in this high-throughput world) would be something that we would normally handle via an offset term in a statistical model.

The problem we have in the multivariate world of ordination is that we can't apply an offset (newer methods are starting to become available, so-called model-based ordination, which do allow offsets, but these new methods are simply not ready for use with data such as yours with a large number of variables), so people look for alternatives.

I personally don't understand the logic behind avgdist(). Wouldn't it be better to run rrarefy() lots of times and take the mean count of each species over the set of randomly rarefied samples - asymptotically I would expect this to converge as you increase the number of random draws. Then you could compute the dissimilarity or your ordination using those "expected" rarefied counts. At the very least it would be a heck of a lot quicker to do what avgdist() does in the way I suggest as you don't need to compute the dissimilarity for each random rarefied sample draw.

That said, I also wouldn't do that in general or in your case. As Jari said, I would try to work with the raw counts as much as possible and then use an offset (based on the size factors and other considerations) when fitting classical statistical models to your data (e.g. GLMs or GLMMs) or a transformations of the raw count data that tries to account for the extra dispersion in the data that arises due to these methodological considerations with high throughput data.

0 replies

gavinsimpson · 2022-10-04T09:32:46Z

gavinsimpson
Oct 4, 2022
Maintainer

As for some suggestions on how to proceed, I would suggest taking a look at some of the work of Susan Holmes and colleagues:

Sankaran, K., Holmes, S.P., 2019. Latent variable modeling for the microbiome. Biostatistics 20, 599–614. https://doi.org/10.1093/biostatistics/kxy018

Jeganathan, P., Holmes, S.P., 2021. A Statistical Perspective on the Challenges in Molecular Microbial Biology. J. Agric. Biol. Environ. Stat. 26, 131–160. https://doi.org/10.1007/s13253-021-00447-1

Sankaran, K., Holmes, S.P., 2019. Multitable Methods for Microbiome Data Integration. Front. Genet. 10, 627. https://doi.org/10.3389/fgene.2019.00627

At least these would give you a better understanding of the issues (though note Susan is firmly in the "thou shalt not rarefy" camp) and give you some ideas about how statisticians are attempting to tackle the problems.

0 replies

jarioksa · 2022-10-04T09:56:39Z

jarioksa
Oct 4, 2022
Maintainer

@gavinsimpson Species averages of randomly rarefied samples converge to observed (non-rarefied) species frequencies adjusted to sub-sample size.

0 replies

gavinsimpson · 2022-10-04T10:00:34Z

gavinsimpson
Oct 4, 2022
Maintainer

@jarioksa isn't that what people want, to get counts adjusted to the sub-sample size?

0 replies

gavinsimpson · 2022-10-04T10:02:04Z

gavinsimpson
Oct 4, 2022
Maintainer

Or did you mean you could just avoid the random sampling and simply use data in proportion to the observed frequencies?

0 replies

jarioksa · 2022-10-04T10:12:27Z

jarioksa
Oct 4, 2022
Maintainer

@gavinsimpson not really: they want equalized sample sizes, and with averages they get non-rarefied numbers. For a subsample of 50 in BCI, the species averages converge decostand(BCI, "total")*50. I really do not understand the widespread use of random rarefaction in genomics and therefore I may not be able to speak for the approach, but I think they are concerned of having too many rare OTUs with large number of sequences, and want to equalize the detection probablity of rare cases independent of number of sequences (depth). The average of large number of randomly rarefied samples is about the same as using relativized data (hiding the sample size, depth), but keeping more rare cases (OTUs) in originally large samples (and naturally, rrarefy adds some random error unlike relativized data).

0 replies

jarioksa · 2022-10-04T10:16:01Z

jarioksa
Oct 4, 2022
Maintainer

FWIW, adding a warning on >1 minimum count in rarefy, drarefy, rrarefy and rareslope caused no regressions in reverse dependence tests of 172 CRAN packages depending on vegan. However, I didn't test BioConductor packages (too painful, in particular because they do not support Apple Silicon processors).

0 replies

gavinsimpson · 2022-10-04T10:25:38Z

gavinsimpson
Oct 4, 2022
Maintainer

@jarioksa I see what you mean.

Did you look at that test in avgdist() which seemed suss to me

vegan/R/avgdist.R

Line 20 in e5f9251

inputcast <- inputcast[c(rowSums(inputcast) %in% sample),]

If it is suss - I was just reading code that I've never run - fixing this before the release would be good, which I can do.

0 replies

jarioksa · 2022-10-04T10:31:06Z

jarioksa
Oct 4, 2022
Maintainer

@gavinsimpson The main author of avgdist is @Microbiology (Geoffrey Hannigan). Better ask his comment. My first look was that your fix was OK, but better ask Hannigan.

0 replies

gavinsimpson · 2022-10-04T10:35:36Z

gavinsimpson
Oct 4, 2022
Maintainer

@jarioksa OK - there are a couple of other things in that function that I want to fix so I'll do a PR with all the changes and then ask @Microbiology to comment.

0 replies

DeaconOfBiology · 2022-10-04T13:47:22Z

DeaconOfBiology
Oct 4, 2022
Author

Hey guys. Thanks a lot for the help and suggestions. I have read some of the dislike of rarefy and was going to avoid it, but I came across Dr. Pat Schloss' YouTube channel and his videos demonstrating why he advocates for rarefying when it comes to diversity and richness analyses and it was convincing enough to try it (I believe he was the aforementioned @Microbiology Dr. Geoffrey Hannigan's Advisor at one point). If your interested, I've posted links below to said videos for rational and demonstration. If you have time to watch them, I'd like your thoughts on them. In any case, thanks again for the reading suggestions and for this great discussion!

To rarefy or not to rarefy:
https://www.youtube.com/watch?v=ht3AX5uZTTQ

Rarefy vs Normalize:
https://www.youtube.com/watch?v=t5qXPIS-ECU&t=531s

Minimizing sampling effort impact on bray curtis
https://www.youtube.com/watch?v=6TjzjClQsOg

0 replies

Microbiology · 2022-10-04T14:01:04Z

Microbiology
Oct 4, 2022

Thanks all. I just went over to the PR and it looks good from my end.

0 replies

gavinsimpson · 2022-10-04T14:27:07Z

gavinsimpson
Oct 4, 2022
Maintainer

@TJrogers86 I would say that in many of these discussions about rarefying counts the discussants are talking past one another. Pat's videos (IIRC; I haven't watched all of this YouTube channel) on this topic focus on the impact of rarefying on computing dissimilarities.

Many people use rrarefy() for way more than that though - like using rrarefy() followed by a GLM or GLMM, which would be extremely bad.

That said, just because doing the avgdist() thing works in Pat's example(s) doesn't mean it will work in your data or in all other data sets - part of the thing that concerns more statistically minded individuals in these discussions is knowing where a method breaks not that it works in one or more specific cases. There are plenty of examples where things break, but, for obvious reasons, we haven't looked at all possible data sets or all possible use-cases as yet...

Personally, as someone who has only relatively recently started to be dragged into working with data like this, I'm not keen on using rarefying counts via random draws using rrarefy(), in part because I prefer to use RDA on transformed data rather than distance based methods (although dbrda() is proving useful too) and you can't (easily) compute a whole pile of RDA ordination models, 1 per random rrarefy() draw and then "average" them.

1 reply

DeaconOfBiology Oct 4, 2022
Author

@gavinsimpson Thanks for the input. Im actually looking at constructing an rda or dbrda and accounting for spatial auto-correlation using distance-based Moran's eigenvector maps (dbMEM). I wasnt sure if it would be better to rarefy in this case or use hellinger transformation on the data. I think after our discussion I may use the latter.

gavinsimpson · 2022-10-04T14:30:02Z

gavinsimpson
Oct 4, 2022
Maintainer

I felt this was likely better off as a discussion as the original problem wasn't an issue with {vegan} per se, and the conversation had veered off into discussion territory so it was best to close the issue but keep the discussion so it can be surfaced by Google should others have similar issues.

1 reply

DeaconOfBiology Oct 4, 2022
Author

Yeah, thanks for taking the time out of your day to discuss this with me. It has been an eye opener. Im finding my self falling down statistic rabbit holes daily and this has been one of the few fruitful ones. I wont take anymore of y'all's time.

rrarefy keeping unrarefied results of samples with >depth than subsample #537

DeaconOfBiology Oct 3, 2022

Replies: 26 comments · 2 replies

gavinsimpson Oct 3, 2022 Maintainer

DeaconOfBiology Oct 3, 2022 Author

jarioksa Oct 3, 2022 Maintainer

DeaconOfBiology Oct 3, 2022 Author

DeaconOfBiology Oct 3, 2022 Author

jarioksa Oct 3, 2022 Maintainer

DeaconOfBiology Oct 3, 2022 Author

DeaconOfBiology Oct 3, 2022 Author

DeaconOfBiology Oct 3, 2022 Author

gavinsimpson Oct 3, 2022 Maintainer

DeaconOfBiology Oct 3, 2022 Author

jarioksa Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

jarioksa Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

jarioksa Oct 4, 2022 Maintainer

jarioksa Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

jarioksa Oct 4, 2022 Maintainer

gavinsimpson Oct 4, 2022 Maintainer

DeaconOfBiology Oct 4, 2022 Author

Microbiology Oct 4, 2022

gavinsimpson Oct 4, 2022 Maintainer

DeaconOfBiology Oct 4, 2022 Author

gavinsimpson Oct 4, 2022 Maintainer

DeaconOfBiology Oct 4, 2022 Author

DeaconOfBiology
Oct 3, 2022

Replies: 26 comments 2 replies

gavinsimpson
Oct 3, 2022
Maintainer

DeaconOfBiology
Oct 3, 2022
Author

jarioksa
Oct 3, 2022
Maintainer

DeaconOfBiology
Oct 3, 2022
Author

DeaconOfBiology
Oct 3, 2022
Author

jarioksa
Oct 3, 2022
Maintainer

DeaconOfBiology
Oct 3, 2022
Author

DeaconOfBiology
Oct 3, 2022
Author

DeaconOfBiology
Oct 3, 2022
Author

gavinsimpson
Oct 3, 2022
Maintainer

DeaconOfBiology
Oct 3, 2022
Author

jarioksa
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

jarioksa
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

jarioksa
Oct 4, 2022
Maintainer

jarioksa
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

jarioksa
Oct 4, 2022
Maintainer

gavinsimpson
Oct 4, 2022
Maintainer

DeaconOfBiology
Oct 4, 2022
Author

Microbiology
Oct 4, 2022

gavinsimpson
Oct 4, 2022
Maintainer

DeaconOfBiology Oct 4, 2022
Author

gavinsimpson
Oct 4, 2022
Maintainer

DeaconOfBiology Oct 4, 2022
Author