-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5 #4232
Comments
Can you try to use the query explanation feature: The long runtime could be related to sub-optimal query optimization. |
I let the query actually finish out with 3.7.4 and it took nearly 24 hours. Crazy that the query used to run efficiently in 4 seconds but now takes nearly a full day.
Here is the explanation:
If I switch the query to remove one of the values in the VALUES declaration, it runs just as slow:
But by removing the VALUES declaration and subbing in just
Reworking the query to UNION both different VALUES has it run in the original 4s
It appears that there is an issue with VALUES in the query evaluation. |
Is it correct that the query you are having performance issues with is using a VALUES clause, but if you remove the VALUES clause then there is no performance difference? |
@hmottestad I would say that is accurate. These two essentially equivalent queries run in 23 hours and 4 seconds respectively.
|
This is the generically correct way to parse VALUES clauses. An optimizer can potentially look at the ordering in the algebra to push the values clause down into the join tree (by inspecting which parts of the tree have variables bound in the VALUES clause).
I just spent a fruitless two hours on this, tricky to see where the problem is. TBC. |
This is the generically correct way to parse VALUES clauses. An optimizer can potentially look at the ordering in the algebra to push the values clause down into the join tree (by inspecting which parts of the tree have variables bound in the VALUES clause).
I'm trying to construct a benchmark that reproduces the performance difference, using a synthetic dataset (that hopefully is a little easier to work with than the full dataset provided by @daltontc ). Struggling to show any significant performance difference though. Will try and look closer at the shape of the dataset. @daltontc any insights from your end on how your data is structured? Ideally I'd like to generate a dataset that we can use to reproduce the performance issue, but not quite at the scale of your original set. |
This benchmark uses generated data conforming to the query pattern, and executes performance tests on both the variant with a VALUES clause, and (as a baseline) the simple equivalent query. Unfortunately, sofar I have been unable to reproduce any significant performance difference.
There is a trig file in the repo that @daltontc linked too. It reproduces the issue for me. I think it's an issue with the join order from the query planner. I'll explain more in a bit, will also hopefully push a commit to your branch. |
Without VALUES clause
With VALUES clause
Notice that the plans swap the bottom two joins. This is because I've "helped" the join optimizer by adding a lot of The join optimizer swaps the two last joins because See the inline comments marked with |
@jeenbroekstra I'm not going to dig any further right now. If you don't have time yourself then maybe I'll pick it up in a few days. |
@hmottestad nice work, thanks! I'll try and pick it up again later this week, but feel free to also dig in if you have the time :) |
Current Behavior
In versions 3.7.4 and 4.1.0, iterating over a TupleQueryResult with large results is extremely slow when compared to older RDF4j versions (2.5.5).
With newer RDF4j versions, running the following query over the CHEBI Lite ontology stored in a graph takes over an hour to process results whereas in RDF4j 2.5.5 it only takes 4s.
See https://github.com/daltontc/rdf4jTest/tree/bug/tuplequeryresults_slow_iteration for an example
Expected Behavior
The time it takes to iterate over large query results should be relatively consistent between versions, if not better in later versions.
Steps To Reproduce
Version
3.7.4, 4.1.0
Are you interested in contributing a solution yourself?
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: