Possibly an optimization of concurrent fact indexer #27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am using your excellent and readable implementation to learn about Datalog evaluation techniques and was a bit puzzled by the fact indexer which I think can be optimized to avoid hitting some worst-case edge cases. This is just a suggestion of course :).
The concurrent fact indexer maintains a coarse and fine index of
facts, where the coarse index is on the predicate symbol and the fine
index is on the predicate symbol, argument index and constant at that
argument. The fact indexer returns an overapproximation of the set of
facts that might match the input query but tries to minimize the
returned fact set using a heuristic which picks from the fine index
the fact set maintained at the argument position for which the most
distinct constants have been seen so far. Under the assumption that
the fact sets are roughly of equal size for each constant this is a
sound strategy, but it is easy to imagine worst cases where fact sets
that are much too large are returned. E.g. consider the facts
and the query
f(c, _, d)?
Two fine indexes will be considered as candidates; the first and the
third. The first index has three distinct constants and the third has
only two. The candidate fact set returned is thus the 10,001 element
set
{ f(c, x1, b), ..., f(c, x10000, b), f(c, x, d) }
rather than the singleton set
{ f(c, x, d) }
The fact indexer cannot evaluate the size of the underlying container
since it has to work for all
Iterable
containers, includingConcurrentLinkedBag
which cannot implement theContainer
interface forefficiency reasons. It is however quite easy to add a size method to
this collection type and pass it as a lambda to the fact indexer. The
fact indexer can then just pick the smallest set from the fine index
instead of relying on a heuristic.