Replies: 2 comments 4 replies
-
This can't be answered in general. Having to query each corpus after each other also adds overhead, e.g. when sorting the results. If a query already produces sorted results that can be used by the optimizer to avoid addditional sorting, but when multiple corpora are selected the manual sorting might have to be done again (just as one example where multiple corpora can add overhead). ANNIS 4 partitions its data into graph edges components and it only loads the components it needs. For corpora without pointing or dominance relations in practice all components are loaded to show the token and span annotations. The "optimal" size really dependents on the server the corpus is hosted on. On desktop computers, the default is to use the disk instead of main memory, but on a server with a lot of main memory I would use a corpus size that uses the main memory as much as possible. On the interacticce graphANNIS command line (https://korpling.github.io/graphANNIS/docs/v3/cli.html#info) you can select a corpus, use the |
Beta Was this translation helpful? Give feedback.
-
It seems that the best solution for performance is to split my corpus into 5 subcorpora. If I keep one corpus, the overall processing time for conversion into relannis/graphannis is considerably longer than if I process 5 subcorpora. When searching 5 subcorpora, I get some initial results almost immediately, but ANNIS gets blocked for a few further seconds, probably to sort the results (even if mine should already be sorted). If I use one single 40M+ token corpus, the query takes much longer to return any result, but when it returns them, it is immediately possible to use ANNIS again (for example, clicking on the search button). Splitting has also the advantage that if I spot a mistake in a few files, I can rebuild only 1/5 of the corpus. |
Beta Was this translation helpful? Give feedback.
-
My corpus has 40M+ tokens. Since I see that Annis can query more than one corpus at the same time, is it (strongly) advisable to split it into two or more corpora in order to get performance gain? Which corpus size (10M?, 20M?) could be good one for performance?
Beta Was this translation helpful? Give feedback.
All reactions