-
Notifications
You must be signed in to change notification settings - Fork 14
Performance: XPath Query Evaluation
Much of this writeup won't make sense without having read the External Instances document and having an understanding of how those instances work.
It is also generally useful to understand what an XPath filter (predicate) expression is (like /data/weights[@scale = 'kg']
)) at a high level, and specifically understanding the notion of the evaluation context (IE: That inside of the predicate expression you are referring to a relative, nested evaluation frame).
You should almost certainly read the XPath Query Evaluation Primer before reading this document.
In CommCare, data is read and manipulated with XPath.
XPath Expressions can contain functions, operators, and other element to produce a result, like
min(/data/inputs[@priority > 5][0] + 15, /data/fallback)
The majority of performance considerations in CommCare come from XPath expressions which reference data models (like /data/fallback
, especially when referring to potentially large datasets, like cases or fixtures, and most specifically when referring to datasets with filters (like [@priority > 5][0]
).
This document will work through the way that those queries are performed, what performance considerations exist in the code, and how our optimizations work.
There are primary performance concerns with the native process for XPath query evaluations in practice with CommCare specifically, each of which have their own relevance for optimizations
Example:
instance('casedb')/casedb/case[@case_id='unique_id']/case_name
requires a full linear scan of each <case>
element by default.
Example: Fetching a case's siblings
count(instance('casedb')/casedb/case
[
index/parent
=
instance('casedb')/casedb/case
[
@case_id
=
'unique_id'
]
/index/parent
]
)
requires a full scan of the case database for every case in the database O(N^2), despite one set being a reference to a constant.
- Redundancy in evaluations
When optimizing for situations like nested queries there is a significant amount of redundancy in the processing, even if the final results are cached.
- The high cost of transactional I/O when using Virtual Instances.
Ideally large chunks of virtual instances could be read from the database at once (IE: All patient cases) rather than once at a time, but the scans are performed in a linear way.
Example: After a fairly simple index optimization, the query
instance('casedb')/case[@case_type='patient'][patient_status = 'new']
can eliminate required reads for the first predicate, but the remaining query will still require a fetch for a potentially large, divided set of remaining case records.
The primary entry point for optimizations occurs in virtual instances at the root level.
Along with other optimizations which occur along the way (opportunistic caching, etc), the majority of complexity in the performance code occurs when the evaluation context is about to expand the child step of a query against a root level element.
For example:
Current Working Set:
instance('casedb')/casedb[1]/
Remaining query to evaluate:
case[@case_type='patient'][@status='open'][is_enrolled='yes']/case_name
Before the evaluation context asks each member of the working set for all "candidate" children to evaluate for processing, it provides the element referenced by the current fully qualified reference the opportunity to instead provide a list of children (in the form of their 'multiplicity', or their child index in a fully qualified reference) which match one or more of the queries.
In this example the Evaluation Context informs the casedb
root node that it is looking for which of the node's children are named case
and match the predicate set [@case_type='patient'][@status='open']
, to which the node would reply with a set of fully qualified references
Current Working Set:
instance('casedb')/casedb[1]/case[23]/
instance('casedb')/casedb[1]/case[24]/
instance('casedb')/casedb[1]/case[30]/
instance('casedb')/casedb[1]/case[42]/
instance('casedb')/casedb[1]/case[49]/
Remaining query to evaluate:
[is_enrolled='yes']/case_name
which it matched however it can, along with any predicates which could not be optimized.
If any predicates remain to be evaluated, the evaluation context filters the current working set that is provided linearly by those predicates and then proceeds to the next step as usual.
One quite notable consideration to make regarding this structure is that the order of the predicates can be important.
Assume for instance that the engine is able to optimize both the [@case_type='patient']
and the [@status='open']
predicate in the <casedb>
for instance, but not a third predicate [unoptimized = 'yes']
.
If those predicates were provided in the expression
instance('casedb')/casedb/case[@case_type='patient'][@status='open'][unoptimized = 'yes']/value
the engine would be able to optimize the first two in a constant time lookup, and then proceed to do a linear scan over Open Patient cases to evaluate which meet the [unoptimized = 'yes']
condition.
If, however, the predicates were provided in the expression
instance('casedb')/casedb/case[unoptimized = 'yes'][@case_type='patient'][@status='open']/value
The latter two optimizations would not be effective, since a linear scan is performed in the first filter.
If the non-optimized filter appears in the middle
instance('casedb')/casedb/case[@case_type='patient'][unoptimized = 'yes'][@status='open']/value
The @case_type
optimization triggers, but @status
is processed with the same linear scan as [unoptimized = 'yes']
.
The most basic optimizations in the engine identify key/value pair matches which can be accomplished through an indexed lookup.
These lookups must be in the format
[MODEL_KEY = XPATH_VALUE_EXPRESSION]
or
[MODEL_KEY != XPATH_VALUE_EXPRESSION]
Where MODEL_KEY
is a single reference to an indexed value on the requested model, and XPATH_VALUE_EXPRESSION
is evaluated to produce a lookup into that index.
Generally speaking the performance of key/value queries should be logarithmic to the size of their dataset (as is consistent with a dB index), and each query will execute a single DB request.
casedb/case
@case_type
@status
@case_id
-
@owner_id
<- Since 2.36 index/INDEX_NAME
@external_id
-
@state
<- Only in formplayer -
@category
<- Only in formplayer`
ledgerdb/ledger
@entity-id
Indexed Fixtures
- Named indexes as provided by the
<schema>
block
Example:
instance('casedb')/casedb/case[ @case_id = /data/selected_case/id ]/output_value
In the case database, any keys other than index/*
can actually be combined into a single request instead of a request per-predicate. This both saves a DB query, and also prevents the need to read two large query response sets from the db, requiring only the ids of cases which meet both conditions to be returned.
This is only currently meaningful on expressions of the following pattern (pairing @case_type
and @status
)
instance('casedb')/casedb/case[@case_type='patient'][@status='open']
but is a primary optimization of the system, as that pattern is followed in most of the high-volume contexts used by apps
A set member optimization matches an key value to one of a set of potential matching values. This is currently implemented for Case indexes (IE: index/*
) which sometimes used in contexts where the index needs to be matched to one of many values.
This optimization follows the pattern
[selected('value_one value_two value_three', index/*)]
or
[selected(/path/to/space_separated_list, index/*)]
where the second argument is a valid key (currently only case indexes are supported), and the first argument is a space separated set of strings of which the index may be one.
There are more complex situations which are optimized by the engine, and generally speaking future additions and changes will be shifted into the Query Planner, including backporting the existing optimizations into that system.