Corpus size #873

gcelano · 2024-11-14T13:01:08Z

gcelano
Nov 14, 2024

My corpus has 40M+ tokens. Since I see that Annis can query more than one corpus at the same time, is it (strongly) advisable to split it into two or more corpora in order to get performance gain? Which corpus size (10M?, 20M?) could be good one for performance?

thomaskrause · 2024-11-15T10:50:37Z

thomaskrause
Nov 15, 2024
Maintainer

This can't be answered in general. Having to query each corpus after each other also adds overhead, e.g. when sorting the results. If a query already produces sorted results that can be used by the optimizer to avoid addditional sorting, but when multiple corpora are selected the manual sorting might have to be done again (just as one example where multiple corpora can add overhead).

ANNIS 4 partitions its data into graph edges components and it only loads the components it needs. For corpora without pointing or dominance relations in practice all components are loaded to show the token and span annotations.

The "optimal" size really dependents on the server the corpus is hosted on. On desktop computers, the default is to use the disk instead of main memory, but on a server with a lot of main memory I would use a corpus size that uses the main memory as much as possible. On the interacticce graphANNIS command line (https://korpling.github.io/graphANNIS/docs/v3/cli.html#info) you can select a corpus, use the preload command to make sure all components are loaded and then check with info how large the memory consumption is.

4 replies

gcelano Nov 17, 2024
Author

I can use preload (the corpus requires ~40MB), but how can I be sure that, when the corpus is used, it is fully loaded into the RAM? In service.toml, I have disk_based= false and cache = {PercentageOfFreeMemory= 100}: is this enough?

Can namespace use for PAULA file names affect indexing in Annis? My files all start with a string identifying an author ("tlg2200.tlgxxx"), which is interpreted as a namespace by Pepper: i.e., all files related to a single author start with the same namespace. There is therefore no grouping, for example, for all dependency-related files across the corpus. Can this be an issue? This has also to do with the sorting overhead: since my annotation files for each author all start with the same namespace, can this speed up sorting if the files in each corpus are already sorted?

thomaskrause Nov 20, 2024
Maintainer

The {PercentageOfFreeMemory= 100} should be changed to something that still leaves space for the operating system and other services, e.g. 75%. 40 MB seems a little bit too small for 40 Million token. The disk_based configuration only comes into effect when the corpus is imported via the web-interface/REST service. If you change the parameter you would have to overwrite the corpus.

When using the interactive command line, the mode in which a corpus is imported is configured using the set-disk-based command, e.g. set-disk-based false for using the in memory implementations. You can also execute the re-optimize command in the interactive command line after setting set-disk-based false to change the implementations of the currently selected corpus in the CLI.

You can show which implementation is currently used for a specific graph component by using the info command after preload. E.g. this is the first part for the disk-based import of the pcc2 corpus:

pcc2> info
Status: "fully loaded"
Token search shortcut possible: true
------------
Component Coverage//: 0 annnotations
Stats: nodes=613, avg_fan_out=7.06, max_fan_out=212, max_depth=1
Implementation: DiskAdjacencyListV1
Status: "fully loaded"
------------
[...]

The coverage component uses the DiskAdjacencyListV1 implementation. In theory it is possible to mix main-memory and disk-based component implementations, but the import function will use the ones configured except for some components that would hurt the performance in an extreme way when they are disk-based. The same component with a main memory implementation would be listed as

pcc2> info
Status: "fully loaded"
Token search shortcut possible: true
------------
Component Coverage//: 0 annnotations
Stats: nodes=613, avg_fan_out=7.06, max_fan_out=212, max_depth=1
Implementation: AdjacencyListV1
Status: "fully loaded"
[...]

Here the AdjacencyListV1 implemention is used instead. There are different implementations for different types of graphs, but the disk-based ones typically start with Disk.

thomaskrause Nov 20, 2024
Maintainer

I also just realized that the list command has changed its output. The documentation is still giving an example with explicit sizes per corpus, but since this has been difficult to estimate correctly, it was changed to only show the overall memory status when preloading/loading a corpus. E.g. if you only selected one corpus, the messages will show you how much space graphANNIS will use until corpora are unloaded from the cache and how much cache was used after the preload command (which is effectively the corpus size in main memory).

>> corpus pcc2
pcc2> preload
11:12:19 [INFO] Loaded corpus pcc2
11:12:19 [INFO] Total cache size is 11.53 MB / 6927.23 MB and loaded corpora are: pcc2.
11:12:19 [INFO] Total cache size is 15.24 MB / 6928.15 MB and loaded corpora are: pcc2.
11:12:19 [INFO] Preloaded corpus in 9 ms

In this case the corpus size is around 15 MB.

gcelano Nov 20, 2024
Author

I meant ~40G of RAM occupied by the corpus. It seems that it is using RAM, but there are different kinds of implementations and many refer to "0 annotations", so I do not know if they are in place:

Status: "fully loaded"
Token search shortcut possible: true
------------
Component Coverage/annis/: 0 annnotations
Stats: nodes=0, avg_fan_out=0.00, max_fan_out=0, max_depth=1, tree
Implementation: AdjacencyListV1
Status: "fully loaded"
------------
Component Coverage/tlg2003/: 0 annnotations
Stats: nodes=127011, avg_fan_out=0.96, max_fan_out=216, max_depth=1, tree
Implementation: AdjacencyListV1
Status: "fully loaded"

...

------------
Component Pointing/tlg2003/dep: 116625 annnotations
Stats: nodes=121817, avg_fan_out=0.96, max_fan_out=22, max_depth=16, tree
Implementation: PrePostOrderO32L8V1
Status: "fully loaded"
------------

...

------------
Component Ordering/annis/: 0 annnotations
Stats: nodes=15949606, avg_fan_out=1.00, max_fan_out=1, max_depth=1008852, tree
Implementation: LinearO32V1
Status: "fully loaded"
------------
...
------------
Component PartOf/annis/: 0 annnotations
Stats: nodes=16941622, avg_fan_out=1.00, max_fan_out=1, max_depth=3
Implementation: DenseAdjacencyListV1
Status: "fully loaded"

gcelano · 2024-11-20T15:43:25Z

gcelano
Nov 20, 2024
Author

It seems that the best solution for performance is to split my corpus into 5 subcorpora. If I keep one corpus, the overall processing time for conversion into relannis/graphannis is considerably longer than if I process 5 subcorpora. When searching 5 subcorpora, I get some initial results almost immediately, but ANNIS gets blocked for a few further seconds, probably to sort the results (even if mine should already be sorted). If I use one single 40M+ token corpus, the query takes much longer to return any result, but when it returns them, it is immediately possible to use ANNIS again (for example, clicking on the search button). Splitting has also the advantage that if I spot a mistake in a few files, I can rebuild only 1/5 of the corpus.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus size #873

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Corpus size #873

gcelano Nov 14, 2024

Replies: 2 comments · 4 replies

thomaskrause Nov 15, 2024 Maintainer

gcelano Nov 17, 2024 Author

thomaskrause Nov 20, 2024 Maintainer

thomaskrause Nov 20, 2024 Maintainer

gcelano Nov 20, 2024 Author

gcelano Nov 20, 2024 Author

gcelano
Nov 14, 2024

Replies: 2 comments 4 replies

thomaskrause
Nov 15, 2024
Maintainer

gcelano Nov 17, 2024
Author

thomaskrause Nov 20, 2024
Maintainer

thomaskrause Nov 20, 2024
Maintainer

gcelano Nov 20, 2024
Author

gcelano
Nov 20, 2024
Author