Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly an optimization of concurrent fact indexer #27

Conversation

ulrikrasmussen
Copy link
Contributor

I am using your excellent and readable implementation to learn about Datalog evaluation techniques and was a bit puzzled by the fact indexer which I think can be optimized to avoid hitting some worst-case edge cases. This is just a suggestion of course :).

The concurrent fact indexer maintains a coarse and fine index of
facts, where the coarse index is on the predicate symbol and the fine
index is on the predicate symbol, argument index and constant at that
argument. The fact indexer returns an overapproximation of the set of
facts that might match the input query but tries to minimize the
returned fact set using a heuristic which picks from the fine index
the fact set maintained at the argument position for which the most
distinct constants have been seen so far. Under the assumption that
the fact sets are roughly of equal size for each constant this is a
sound strategy, but it is easy to imagine worst cases where fact sets
that are much too large are returned. E.g. consider the facts

f(a, x, b).
f(b, x, b).
f(c, x1, b).
f(c, x2, b).
...
f(c, x10000, b).
f(c, x, d).

and the query f(c, _, d)?

Two fine indexes will be considered as candidates; the first and the
third. The first index has three distinct constants and the third has
only two. The candidate fact set returned is thus the 10,001 element
set

{ f(c, x1, b), ..., f(c, x10000, b), f(c, x, d) }

rather than the singleton set

{ f(c, x, d) }

The fact indexer cannot evaluate the size of the underlying container
since it has to work for all Iterable containers, including
ConcurrentLinkedBag which cannot implement the Container interface for
efficiency reasons. It is however quite easy to add a size method to
this collection type and pass it as a lambda to the fact indexer. The
fact indexer can then just pick the smallest set from the fine index
instead of relying on a heuristic.

Ulrik Rasmussen and others added 3 commits November 7, 2024 11:21
The concurrent fact indexer maintains a coarse and fine index of
facts, where the coarse index is on the predicate symbol and the fine
index is on the predicate symbol, argument index and constant at that
argument. The fact indexer returns an overapproximation of the set of
facts that might match the input query but tries to minimize the
returned fact set using a heuristic which picks from the fine index
the fact set maintained at the argument position for which the most
distinct constants have been seen so far. Under the assumption that
the fact sets are roughly of equal size for each constant this is a
sound strategy, but it is easy to imagine worst cases where fact sets
that are much too large are returned. E.g. consider

f(a, x, b)
f(b, x, b)
f(c, x1, b)
f(c, x2, b)
...
f(c, x10000, b)
f(c, x, d)

and the query f(c, _, d)?

Two fine indexes will be considered as candidates; the first and the
third. The first index has three distinct constants and the third has
only two. The candidate fact set returned is thus the 10,001 element
set

{ f(c, x1, b), ..., f(c, x10000, b), f(c, x, d) }

rather than the singleton set

{ f(c, x, d) }

The fact indexer cannot evaluate the size of the underlying container
since it has to work for all Iterable containers, including
ConcurrentLinkedBag which cannot implement the Container interface for
efficiency reasons. It is however quite easy to add a size method to
this collection type and pass it as a lambda to the fact indexer. The
fact indexer can then just pick the smallest set from the fine index
instead of relying on a heuristic.
@aaronbembenek aaronbembenek merged commit 10229d9 into HarvardPL:master Nov 7, 2024
1 check passed
@aaronbembenek
Copy link
Collaborator

Thanks for the contribution! And I'm glad that you're finding AbcDatalog to be a useful resource - that's great to hear! :)

AbcDatalog's data structures are pretty naive. If you are curious to see higher performance Datalog data structures that are still built on top of Java standard library collections, you can check out Formulog, which implements optimal index selection and uses indices that support efficient range queries (the standard approach for Datalog).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants