Possibly an optimization of concurrent fact indexer #27

ulrikrasmussen · 2024-11-07T10:30:26Z

I am using your excellent and readable implementation to learn about Datalog evaluation techniques and was a bit puzzled by the fact indexer which I think can be optimized to avoid hitting some worst-case edge cases. This is just a suggestion of course :).

The concurrent fact indexer maintains a coarse and fine index of
facts, where the coarse index is on the predicate symbol and the fine
index is on the predicate symbol, argument index and constant at that
argument. The fact indexer returns an overapproximation of the set of
facts that might match the input query but tries to minimize the
returned fact set using a heuristic which picks from the fine index
the fact set maintained at the argument position for which the most
distinct constants have been seen so far. Under the assumption that
the fact sets are roughly of equal size for each constant this is a
sound strategy, but it is easy to imagine worst cases where fact sets
that are much too large are returned. E.g. consider the facts

f(a, x, b).
f(b, x, b).
f(c, x1, b).
f(c, x2, b).
...
f(c, x10000, b).
f(c, x, d).

and the query f(c, _, d)?

Two fine indexes will be considered as candidates; the first and the
third. The first index has three distinct constants and the third has
only two. The candidate fact set returned is thus the 10,001 element
set

{ f(c, x1, b), ..., f(c, x10000, b), f(c, x, d) }

rather than the singleton set

{ f(c, x, d) }

The fact indexer cannot evaluate the size of the underlying container
since it has to work for all Iterable containers, including
ConcurrentLinkedBag which cannot implement the Container interface for
efficiency reasons. It is however quite easy to add a size method to
this collection type and pass it as a lambda to the fact indexer. The
fact indexer can then just pick the smallest set from the fine index
instead of relying on a heuristic.

The concurrent fact indexer maintains a coarse and fine index of facts, where the coarse index is on the predicate symbol and the fine index is on the predicate symbol, argument index and constant at that argument. The fact indexer returns an overapproximation of the set of facts that might match the input query but tries to minimize the returned fact set using a heuristic which picks from the fine index the fact set maintained at the argument position for which the most distinct constants have been seen so far. Under the assumption that the fact sets are roughly of equal size for each constant this is a sound strategy, but it is easy to imagine worst cases where fact sets that are much too large are returned. E.g. consider f(a, x, b) f(b, x, b) f(c, x1, b) f(c, x2, b) ... f(c, x10000, b) f(c, x, d) and the query f(c, _, d)? Two fine indexes will be considered as candidates; the first and the third. The first index has three distinct constants and the third has only two. The candidate fact set returned is thus the 10,001 element set { f(c, x1, b), ..., f(c, x10000, b), f(c, x, d) } rather than the singleton set { f(c, x, d) } The fact indexer cannot evaluate the size of the underlying container since it has to work for all Iterable containers, including ConcurrentLinkedBag which cannot implement the Container interface for efficiency reasons. It is however quite easy to add a size method to this collection type and pass it as a lambda to the fact indexer. The fact indexer can then just pick the smallest set from the fine index instead of relying on a heuristic.

aaronbembenek · 2024-11-08T01:38:34Z

Thanks for the contribution! And I'm glad that you're finding AbcDatalog to be a useful resource - that's great to hear! :)

AbcDatalog's data structures are pretty naive. If you are curious to see higher performance Datalog data structures that are still built on top of Java standard library collections, you can check out Formulog, which implements optimal index selection and uses indices that support efficient range queries (the standard approach for Datalog).

Ulrik Rasmussen and others added 3 commits November 7, 2024 11:21

Add fact indexer tests

2ffe8a0

Use :: syntax for methods.

f72e1ba

aaronbembenek merged commit 10229d9 into HarvardPL:master Nov 7, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly an optimization of concurrent fact indexer #27

Possibly an optimization of concurrent fact indexer #27

ulrikrasmussen commented Nov 7, 2024

aaronbembenek commented Nov 8, 2024

Possibly an optimization of concurrent fact indexer #27

Possibly an optimization of concurrent fact indexer #27

Conversation

ulrikrasmussen commented Nov 7, 2024

aaronbembenek commented Nov 8, 2024