-
Notifications
You must be signed in to change notification settings - Fork 29
8. Query Planning
There are numerous ways that executing a query can occur, all of which will return a correct answer, but which can result in significantly different efficiency.
The ordering of operations for executing a query is called the query plan. By default, Asami tries to select an efficient plan, though not necessarily the most optimal. This can be overridden by appending the options :query-plan :user
as option parameters for the asami.core/q
function.
The current Asami planner uses a few principles:
- Order by Resolution Counts
- Chain by Connected Variables
- Push filters as far left as possible
- Push bindings as far right as possible
Consider a pair of constraints on a UK database which seek to find the people named "Elizabeth" who live at the town of Meryton.
[?person :name "Elizabeth"] [?person :lives-at "Meryton"]
While it varies with time, the name "Elizabeth" has been given to babies in the UK at about 0.45%. Assuming that this number has held over time, then the approximate number of people with this name would be approximately 250,000.
Meanwhile, the fictional town of Meryton in Hertfordshire would only have a few hundred people. Making a generous estimate would still only give approximately 250 people.
Constraint | Resolution Count |
---|---|
[?person :name "Elizabeth"] |
~250,000 |
[?person :lives-at "Meryton"] |
~250 |
If this is executed in the given order, then resolving the ?person
variable to people named "Elizabeth" will result in 250,000 rows. Joining this to the people who live in "Meryton" requires iterating over all 250,000 people and checking for their :lives-at
attribute, most of which will be false. Indeed, in many cases, the result of a join will have an upper bound of the count of the smaller expression result.
Executing this in reverse order will start by finding the 250 people who live in Meryton. The join operation then leads to iterating over those 250, and checking the :name
attribute for those who are named "Elizabeth". This is approximately 1000 times less work.
The difference is clearer when a constraint can be used to uniquely identify an entity. Say we want to know what Elizabeth's last name is, but this time we have her phone number:
[?person :surname ?name] [?person :phone "0844 2471 007"]
Executing these constraints in order would lead to iterating over the approximately 56,000,000 people from the first constraint, looking for every person who had the provided number. Executing in reverse order would lead to the one person registered for that number, and a single lookup to find their name.
Determining the size of a constraint resolution has its own cost. Some indices have mechanisms that can provide improvements over simple counting of a resolution, though counting in this way does not require any iterative processing like a query does.
However, the last 2 examples demonstrate constraints with resolution sizes that differ by many orders of magnitude. This sort of disparity is very common, and only requires the general magnitude of resolution sizes rather than exact counts. Since Asami has not been scaling until now, these approximations have not been necessary, but will be needed soon.
At a party we might hear: "... and my husband's sister's mother-in-law's friend's manicurist said that it's great for moisturizing your hands!"
Having missed the context, we can ask query for this advice:
[?p :speaking true] [?p :husband ?h] [?h :sister ?s] [?s :mother-in-law ?mil] [?mil :manicurist ?mn] [?mn :advice ?advice]
It's a party, so several people may be speaking at the same time, but not too many. The number of husbands, sisters and mother-in-laws will be very large. Not everyone has a manicurist, and perhaps not all of them have advice to give. For the purposes of this example, we can give database numbers that look like:
Constraint | Resolution Count |
---|---|
[?p :speaking true] |
20 |
[?p :husband ?h] |
5000 |
[?h :sister ?s] |
4000 |
[?s :mother-in-law ?mil] |
4500 |
[?mil :manicurist ?mn] |
2000 |
[?mn :advice ?advice] |
300 |
Ordering by resolution count means that we want to start with the constraint [?p :speaking true]
since it's the smallest.
The next step needs to take a different approach. If [?mn :advice ?advice]
came next, then the interim result from those 2 constraints will the result of a Cross Product with 60000 terms. This is so large because no variables are shared between the constraint terms. So we want to avoid joining to a constraint which does not share any variables with the current result.
The only constraint that shares a variable with the first one is the one that specified the :husband
property. This is an example where a join is upper bounded by the smaller expression argument, since in most schemas people may not have more than one husband. Of the 20 people speaking, say 8 of them have husbands. So joining those 2 constraints will result in an interim result of only 8 terms.
Staying with the rule of only joining with already bound variables, the next constraint will be [?h :sister ?s]
. Unlike the :husband
property, :sister
may appear multiple times for each individual, meaning that the interim answer can get larger. In our case, perhaps the result goes up to 12.
This leads to an important point: Even when choosing constraints with small resolutions that share variables, it is still possible to create interim results that are up to the size of a cross product. Predicting the size of any join will always be difficult, as it depends on the properties of the data schema, as well as the specifics of the concrete data. However, the general principles of going with the smallest first, and sticking with constraints that have already-bound variables will generally offer an efficient path.
Based on these principles, the rest of the constraints will be processed in order, leading to the final result.
However, some chains can have branches. For instance, the Mother-in-law may live in in a town that has the name Meryton:
[?mil :lives-at ?town] [?town :name "Meryton"]
This creates a choice once the ?mil
variable has been bound, and it should be chosen first, since not many people live in Meryton. However, since only a few towns can have the name "Meryton", then this constraint will have a small resolution as well, and can lead to it being the most appropriate start of the chain. So the new chain would be executed in the following order:
[?town :name "Meryton"] [?mil :lives-at ?town] [?mil :manicurist ?mn] [?mn :advice ?advice]
[?s :mother-in-law ?mil] [?h :sister ?s] [?p :husband ?h] [?p :speaking true]
While typically good, it is still possible for this approach to select a sub-optimal query. This is why the :query-plan :user
option is available.
When a query has filters, these will usually result in an interim result being reduced in size. Since the size of interim results controls the complexity of each step, this means that applying filters as soon as possible will reduce the work performed in the query. This is accomplished by applying a filter as soon as its all of its variables have been bound.
Maybe we heard that the mother-in-law wasn't yet 50. We can add that as a filter:
[?town :name "Meryton"] [?mil :lives-at ?town] [?mil :manicurist ?mn] [?mn :advice ?advice]
[?s :mother-in-law ?mil] [?h :sister ?s] [?p :husband ?h] [?p :speaking true]
[?mil :age ?age] [(< ?age 50)]
Assuming that the database only has the age information for 3000 of the people who are mother-in-laws, then the :age
constraint would come in after the :advice
constraint but before the :mother-in-law
constraint. Then as soon as this constraint has come in, the filter can be applied:
[?town :name "Meryton"] [?mil :lives-at ?town] [?mil :manicurist ?mn] [?mn :advice ?advice]
[?mil :age ?age] [(< ?age 50)]
[?s :mother-in-law ?mil] [?h :sister ?s] [?p :husband ?h] [?p :speaking true]
In general, bindings will not be needed until final projection. As a result, they do not get calculated until the end. However, bindings can also be used to define a variable which is then used by a constraint, or a filter. In this case, the binding will need to be added before those dependent expressions are added.
For instance, we can assume that a book takes a year to be published after the author has finished writing. So to find when the author finished writing, we can take 1 from the publication year:
:find ?author-name ?title ?finished
:where
[?author :wrote ?book] [?author :name ?author-name] [?book :title ?title]
[?book :published ?year] [(dec ?year) ?finished]
In this case, every person has a name, so the :name
constraint will have a large resolution. Not many people have written books, but they can write more than one book, and books are typically only written by one person, and always are published in one year. So the ordering of constraints from largest to smallest resolutions is:
[?book :title ?title] [?author :wrote ?book] [?book :published ?year] [?author :name ?author-name]
If a filter were dependent on the ?year
, then it would be inserted as soon as the ?year
was defined, but being a binding it will go on the end:
:find ?author-name ?title ?finished
:where
[?book :title ?title] [?author :wrote ?book] [?book :published ?year]
[?author :name ?author-name] [(dec ?year) ?finished]
But what if we wanted to find a book that the author started in the year they finished another book?
:find ?author-name ?title ?finished
:where
[?author :wrote ?book1] [?book1 :published ?year] [?author :name ?author-name]
[(dec ?year) ?finished] [?book2 :started ?finished] [?book2 :title ?title]
In this case, the ?finished
variable could have been bound by the :started
constraint, which would lead to the binding acting as a filter. However, this change in processing mode is a complexity that Asami does not attempt. Instead, the binding is evaluated first, meaning that the :started
constraint will operate with the ?finished
variable already bound.