-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JOIN LATERAL or Correlated Subquery #100
Comments
Related: "Query hints" #71 |
https://www.stardog.com/labs/blog/stored-query-service/ explains how to call stored queries in Stardog (similar to subqueries but providing better modularity).
@kasei, @afs, @jeenbroekstra and @JervenBolleman:
@rubensworks Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly? |
Any example where this would be relevant that is not related to GraphQL?.. |
An example how a correlated join is a useful addition to un-correlated ones? It's exactly the same situation as in SQL and there're lots of use cases, the classical one is something like "you have a database of products and a set of offers for each product, you want to get 5 highest price offers for each product". GraphQL is one important use case here but this is a core SPARQL extension proposal which is perfectly valid outside of the GraphQL context. |
I think this issue is interesting enough to do a SEP on. Reading through this `LATERAL` proposal, I suspect that it overlaps with the `PREFER` SEP (proposed for #13) and maybe that can be extended. The syntax/usecase is different, but execution wise I suspect the overlap is much higher. As in `PREFER` is a `LATERAL` query that must have an `ORDER BY` and fixed `LIMIT` of 1. Which I had not thought of, and considering I only had my first coffee I am willing to be corrected on ;)
|
I fully agree on the importance of this issue for SPARQL in general. I wouldn't call it crucial for GraphQL per se, as different interpretations are possible on how to apply it to RDF data. But for the interpretation that Ontotext has picked, it definitely makes sense. |
ignore sql machinations related to lateral and correlated subqueries, to concentrate just on the intent to produce from some dependent sub-expression some number of solutions for each solution of some dominant subexpression. should not the general "sidewards information passing" mechanism suffice? is this best accommodated with a union variant? is it not more a matter of just declaring top-down rather than bottom-up reduction - analogous to the conflagration surrounding EXISTS?
or
|
@lisp from the perspective of a SPARQL user (rather than implementer) it would seem more intuitive to have a somewhat bottom-up phrasing for this. Based on your example:
|
given the implicit control flow which applies to the rest of the language, this would be
|
https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077#7d71 AGNOS blog giving some examples, but in SQL using the "Northwind" MS SQL example database |
@lisp We need a syntax showing the dependency (major-minor) explicitly (as in your #100 (comment)). A symmetric syntax like your #100 (comment) won't work because:
@jaw111 Please let's not spell it minor-major. (Do you write RTL by any chance? ;-) Here's an example of deeper nesting, and each level has a dependency on the previous level:
|
@klinovp on twitter: We added correlated subqueries to SPARQL: https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries It's similar to LATERAL in Postgres. |
what are the circumstances that require the |
@klinovp no RTL experience here. Looking back to the original proposed solution, I struggle with the UNION clause, but could that simply be omitted and use a syntax for LATERAL something of a mash-up of VALUES and FILTER EXISTS to indicate the variable(s) for which bindings are to be passed sideways/laterally to the inner pattern/subquery. select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
{
select ?country ?city ?city_name ?population {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
}
order by desc(?population)
limit 2
}
LATERAL ?city {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
}
order by desc(?revenue)
limit 2
}
} One could also pass in multiple variables and make the minor part optional. The outer part need not be a subquery: select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
order by desc(?revenue)
limit 2
}
}
} Similarly, what is in the lateral might just be a graph pattern rather than a subquery. This could be used to give hints to the query processor whether to pass in solutions it already has when evaluating an optional clause (nest loop join) or use bottom up approach. select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
}
} |
@jaw111 thanks for your syntax proposals!
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
order by desc(?revenue)
limit 2
}
}
} order by desc(?population) limit 2
So I'd favor syntaxes that reflect that nesting in the query structure. |
the same question applies to the LATERAL proposal as to sqs:inputs: what are the circumstances which require the declaration for the corresponding variables? |
@lisp minor should execute for each solution of major. |
For the just released Jena 4.6.0 I contributed the "service enhancer" plugin which has lateral joins as one of its features: https://jena.apache.org/documentation/query/service_enhancer.html Example: Fetch one English label for every entity: PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?s ?l {
VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 }
SERVICE <loop:https://query.wikidata.org/sparql> { # SERVICE <loop:> {...} lateral joins with active dataset
SELECT ?l {
?s rdfs:label ?l
FILTER(langMatches(lang(?l), 'en'))
} ORDER BY ?l LIMIT 1
}
}
That was also my finding that its not necessary to declare the variables. In principle, lateral joins require different scoping rules as described in the linked document. |
Question: could someone comment on the semantics of multiple / consecutive lateral joins? Does lateral join n extend lateral join n-1? Or would this be based on shared variables? An example use case could be to
Example queryselect ?thread ?title
?postTitle ?postAuthor
?topAuthor
{
?thread a :Thread ;
:title ?title .
lateral {
select (?title as ?postTitle)
(?author as ?postAuthor)
{
?post :postedIn ?thread ;
:title ?title ;
:author ?author ;
:postedOn ?postedOn .
}
order by desc(?postedOn)
limit 5
}
lateral {
select (?author as ?topAuthor)
{
?post :postedIn ?thread ;
:author ?author .
}
group by ?author
order by desc(count(*))
limit 3
}
} The key here is that the two 'lateral clauses' are meant to be unrelated to each other and only be related to the top-level query. |
@frensjan - interesting example. In the current proposal, it is as @Tpt describes. The second lateral takes the results from the first as input. With no shared variables, that is in effect a cross-product. What shape of result set are you trying to produce? there can be multiple lines for the same The description feels like it is a an "details for A, details for B" (A and B themselves not connected) which suggests a |
Thanks for the responses @afs and @Tpt; understood and as I expected. For as long as the size of the 'legs' is small, the costs of such a cross-product could be acceptable. The use case would be to show related information in an entity listing / table or in an entity detail view. I.e. the name and number of posts for the OP and the number of posts per thread of a listing of threads. Using a pseudo GraphQL query this could be something like: {
threads(limit:10) {
title
creationDate
postCount # lateral join to count(?post) over ?post :parent ?thread
creator { # lateral join to get ?creator :created ?thread
name # fetch ?creator :name ?name
postCount # count(?post) over ?creator :created ?post
}
}
} A more elaborate example for vulnerabilities{
cve(id:"https://nvd.nist.gov/vuln/detail/CVE-2022-42003") {
title
description
...
link {
url
tags {
label
}
}
weaknesses {
id
name
source
}
affected {
cpe {
pattern
version {
from
to
}
instances(limit:10) {
part
vendor
product
version
update
edition
...
}
}
}
}
} Use cases where information is tracked over time it would be useful to show the latest value for a number of attributes; e.g. based on a timestamp associated with the graph such triple occurs in. If the application can assume these attributes change values atomically in this example in the same named graph, this could be solved by one lateral join to the named graph with the last changes. But reality can be stubborn in my experience.
Note that while I'm using GraphQL here as examples, it is intended as pseudo code / a means to convey intent. I think these use-cases hold regardless of whether GraphQL is part of the technology stack. |
With regards to SELECT * {
?s :value ?v
LATERAL {
SELECT DISTINCT ?o {
?o :value ?w .
FILTER(?w > ?v)
}
}
} The intent of the query is to select the However, based on SEP-0007 it would seem that SELECT * {
?s :value ?v
LATERAL {
SELECT DISTINCT ?v ?o {
?o :value ?w .
FILTER(?w > ?v)
}
}
} However, this is a bit awkward, as ?v can never have the output role of the sub-select to use the wording from Correlation and Substitution in SPARQL. |
@frensjan - nice - this brings out the difference between LATERAL and parameterised queries by variable substitution. There are two different This is a difference between LATERAL and parameterised queries by variable substitution. I believe they are both are valid use cases - the same machinery applies but there are slight differences in how it is used, here how to treat scope-hidden variables. It doesn't even have to involve LATERAL: As a parameterized query: SELECT ?o ?w {
?o :value ?w .
FILTER(?w > ?v)
} makes sense. As a query, it rejects everything because the FILTER is error. #57 and others. |
the change to scoping rules is one of the reasons for LATERAL. |
Thanks for the reactions. So would you @afs and @lisp say my second query should work as intended? If so, don't you agree that it is awkward to select a variable @afs, what did you mean by
The query that I gave as an example translates to something like: SELECT * {
?s :value ?v
?o :value ?w .
FILTER(?w > ?v)
} But that's not really the issue I wanted to address. A more involved example: PREFIX : <http://example.org>
SELECT * WHERE {
# messages and when they were created
?message1 a :Message .
?message1 :created ?referenceTimestamp .
# for each message, the last ten messages that were created before it
LATERAL {
SELECT ?referenceTimestamp ?message2 {
?message2 a :Message .
?message2 :created ?created .
FILTER( ?created < ?referenceTimestamp)
}
ORDER BY DESC( ?created )
LIMIT 10
}
} Is the consensus that (given SEP-0006 and 0007) this would work as intended (as per the comments)? The algebra according to Arq (click to expand)
I assume that the query doesn't work as intended if ?referenceTimestamp isn't selected in the sub-select. Although I would like this to work as I think this would be a lot easier to explain to my users! (click to expand)PREFIX : <http://example.org>
SELECT * WHERE {
# messages and when they were created
?message1 a :Message .
?message1 :created ?referenceTimestamp .
# for each message, the last ten messages that were created before it
LATERAL {
SELECT ?message2 {
?message2 a :Message .
?message2 :created ?created .
FILTER( ?created < ?referenceTimestamp)
}
ORDER BY DESC( ?created )
LIMIT 10
}
} |
Yes. |
Yes! That's a very important application. The "up to 10 of" for each item is very hard to do otherwise.
FWIW That's a renamed variable. |
We're fleshing out the implementation of In the Oxigraph tests there are two queries: PREFIX ex: <http://example.org/>
SELECT ?s ?o WHERE {
?s a ex:T.
OPTIONAL { LATERAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
} and PREFIX ex: <http://example.org/>
SELECT ?s ?o WHERE {
?s a ex:T.
LATERAL { OPTIONAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
} From the expected results, it seems that in the first query For me it's surprising that the semantics of these two queries differ. I consider both I would consider I did noticed a subtle difference in the algebra for Jena:
vs
Note the I haven't tested the actual behaviour of Jena yet. But as Jena uses variable renaming, I would expect |
Amazing!
This is the trap. SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } } can be rewritten SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(false) } } because SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } } can be rewritten SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(true) } } because This is why rewriting |
Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that. |
This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below. (I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here) $ arq -version
Apache Jena version 4.10.0 $ cat data.ttl
@prefix ex: <http://example.org/> .
ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T . $ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO exec :: QUERY
PREFIX ex: <http://example.org/>
SELECT ?s ?o
WHERE
{ ?s a ex:T
LATERAL
{ OPTIONAL
{ SELECT ?s ?o
WHERE
{ ?s ex:p ?o }
ORDER BY ?o
LIMIT 2
}
}
}
12:44:39 INFO exec :: ALGEBRA
(project (?s ?o)
(lateral
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(leftjoin
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO exec :: BGP :: ?s rdf:type <http://example.org/T>
12:44:39 INFO exec :: Reorder/generic :: ?s rdf:type <http://example.org/T>
12:44:39 INFO exec :: BGP :: <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO exec :: BGP :: <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO exec :: BGP :: <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s2> <http://example.org/p> ?o
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 | |
| ex:s2 | 21 |
| ex:s2 | 22 |
-------------- $ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO exec :: QUERY
PREFIX ex: <http://example.org/>
SELECT ?s ?o
WHERE
{ ?s a ex:T
OPTIONAL
{ LATERAL
{ SELECT ?s ?o
WHERE
{ ?s ex:p ?o }
ORDER BY ?o
LIMIT 2
}
}
}
12:44:45 INFO exec :: ALGEBRA
(project (?s ?o)
(conditional
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(lateral
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO exec :: BGP :: ?s rdf:type <http://example.org/T>
12:44:45 INFO exec :: Reorder/generic :: ?s rdf:type <http://example.org/T>
12:44:45 INFO exec :: BGP :: <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO exec :: BGP :: <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO exec :: BGP :: <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s2> <http://example.org/p> ?o
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 | |
| ex:s2 | 21 |
| ex:s2 | 22 |
-------------- |
Hi @frensjan -- thanks for providing the clear details. The results look like the same rows to me. And I get the same results locally. What's the issue you've noticed?
Renaming only happens if a variable does not remain in-scope up the evaluation - |
Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!
That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.
That's also my initial 'stance'. But apparently in Oxigraph this is different? |
Please could could record the Oxigraph results here for the record? Jena5 gets the same results, different order:
(there is a different default in-memory graph implementation) |
The Oxigraph tests are on Github. The relevant files are:
|
I misread that as both queries give equal results Those two are different results: subselect_inside_optional.srx :
subselect_outside_optional.srx:
|
Jena gets the same as Oxigraph those if you use the "ref" query engine ( The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step. |
Thanks for investigating this further @afs. I assume choosing the RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS). I've raised the issue at RDF4J (which is larger than just optional lateral). Maybe not the easiest to explain to users (n=1)I must say that this does strike me as odd ... I need to explain this to my users: PREFIX : <http://example.org/>
SELECT * WHERE {
VALUES ?x { :x }
OPTIONAL {
FILTER( BOUND(?x) )
BIND( :y as ?y )
BIND( ?x as ?z )
}
} yields
(I intentionally wrote the FILTER clause first in the optional block for dramatic effect) I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1). BlazegraphNote that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:
VirtuosoVirtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of
|
Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different. I think that LATERAL rules apply inside the subselect because
if the subselect executes with free What am I missing?
This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of Unoptimized evaluation does fails to propagate |
@frensjan, in
"it's" is to mean the recommendation's algebra or the result which it produces in this instance? |
It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the
I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project. |
I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not). Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'. Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in But for e.g. BIND this isn't the case. So you can write: SELECT ?price ?discount {
VALUES ?price { 10 }
OPTIONAL {
VALUES ?discount { 0.10 }
FILTER( ?price * (1 - ?discount) < 10 )
}
} But this query has different semantics: SELECT ?price ?discount {
VALUES ?price { 10 }
OPTIONAL {
VALUES ?discount { 0.10 }
BIND( ?price * (1 - ?discount) AS ?effectivePrice )
FILTER( ?effectivePrice < 10 )
}
} |
But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL. If RDF4J would be 'fixed' though, I would need to train my users in the difference between E.g., this query gives, for every SELECT * {
?movie a :Movie
OPTIONAL { LATERAL {
{ SELECT (AVG(?rating) as ?averageRating) {
?movie :rating ?rating
} }
} }
} That's just hard to grok for me. |
That the OPTIONAL is has normal evaluation so the RHS is
No LHS |
https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan |
Why?
GraphQL is a hot topic amongst developers and tool vendors.
There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.
GraphQL queries are hierarchical and quite regular.
Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.
Assuming a simple companies graph structure and some reasonable
order
andlimit
syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem.
limit
in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.
Previous work
This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:
<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>)
aka using Windowing functions<parent-query> JOIN LATERAL (<child-query>)
, see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)WITH
Common Table ExpressionJOIN (COUNT...GROUP BY) WHERE <=5
A Correlated Subquery is like
LEFT JOIN LATERAL ... ON true
, see StackOverflow: Difference between Lateral and Subquery:FROM x1,x2
wherex2
is a table function (egunnest
)Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning).
However, LATERAL is usually faster than Windowing functions (depending on data and indexing).
Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster
Proposed solution
A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.
rdf:List
(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)
Assume this construct (other syntax suggestions are welcome!):
?var
must be bound byfoo
(and exported in itsselect
if it's a subquery)?var
must also be exported bybar
bar
is iterated for every binding of?var
bar
are appended to the results offoo
Then we could use it to implement the query in question:
(It's more likely to have inverse links
?city x:country ?country
and?company x:city ?city
but for simplicity we use straight links)Considerations for backward compatibility
None?
The text was updated successfully, but these errors were encountered: