Elastic search limit and long term considerations #1948

SimonLi5601 · 2021-05-21T15:33:01Z

SimonLi5601
May 21, 2021

There is a hard limit that each shard can only hold roughly more than 2.1B (nested) documents. Right now, each variant is considered to have more than 300 nested documents that depends on the annotations and genotype samples. If considering the WGS might have more than 40M variants, when the samples (genotypes) increase, the total nested documents can easily increase by billion. It seems the current index schema is not sustainable when there are more samples within one index. On the other hand, dividing the samples into the different indexes also brings challenges to deal with sorting and merge result between multiple searches.
Do you have any suggestion to deal with the issue. Increase the shards might not be able to catch up the increased documents?
One thought I had is that is that possible to decouple the variants and genotypes into different indexes. On one index, we only main the variants without any genotypes and it is shard by all the samples, and the genotypes are maintained in different index by project and only with minimum variant information (variant id). But how to do a join query will be a question. Just want to bring up a thought here.

hanars · 2021-05-21T15:47:31Z

hanars
May 21, 2021
Maintainer

so the short answer is: our long term plan for data scaling is to switch out elasticsearch for something else. I'm happy to discuss this with youfurther, but for a variety of reasons (this one included) ES is simply not going to scale for us.

So our group is not going to be investing any real effort in changing the elastidsearch schema around, although you are more than welcome to try and play around with it. To save you some time, joins are completely impossible in elasticsearch in any remotely performat way - I spent about a month looking into it a couple years ago. Ultimately, I think we do want to switch to some sort of relational model with variants and genotypes being stored as separate but linked entities, which is a big part of why I think we need a different underlying DB.

For what its worth, out largest genomes index has 1000 samples in it and it has not run into these limits yet.

0 replies

SimonLi5601 · 2021-05-21T16:10:47Z

SimonLi5601
May 21, 2021
Author

Thanks for your insight, Hana. I will close this for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic search limit and long term considerations #1948

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Elastic search limit and long term considerations #1948

SimonLi5601 May 21, 2021

Replies: 2 comments

hanars May 21, 2021 Maintainer

SimonLi5601 May 21, 2021 Author

SimonLi5601
May 21, 2021

hanars
May 21, 2021
Maintainer

SimonLi5601
May 21, 2021
Author