-
Notifications
You must be signed in to change notification settings - Fork 14
Performance: XPath Query Evaluation
Much of this writeup won't make sense without having read the Virtual Instances document and having an understanding of how those instances work.
It is also generally useful to understand what an XPath filter (predicate) expression is (like /data/weights[@scale = 'kg']
)) at a high level, and specifically understanding the notion of the evaluation context (IE: That inside of the predicate expression you are referring to a relative, nested evaluation frame).
You should almost certainly read the XPath Query Evaluation Primer before reading this document.
In CommCare, data is read and manipulated with XPath.
XPath Expressions can contain functions, operators, and other element to produce a result, like
min(/data/inputs[@priority > 5][0] + 15, /data/fallback)
The majority of performance considerations in CommCare come from XPath expressions which reference data models (like /data/fallback
, especially when referring to potentially large datasets, like cases or fixtures, and most specifically when referring to datasets with filters (like [@priority > 5][0]
).
This document will work through the way that those queries are performed, what performance considerations exist in the code, and how our optimizations work.
There are primary performance concerns with the native process for XPath query evaluations in practice with CommCare specifically, each of which have their own relevance for optimizations
Example:
instance('casedb')/casedb/case[@case_id='unique_id']/case_name
requires a full linear scan of each <case>
element by default.
Example: Fetching a case's siblings
count(instance('casedb')/casedb/case
[
index/parent
=
instance('casedb')/casedb/case
[
@case_id
=
'unique_id'
]
/index/parent
]
)
requires a full scan of the case database for every case in the database O(N^2), despite one set being a reference to a constant.
- Redundancy in evaluations
When optimizing for situations like nested queries there is a significant amount of redundancy in the processing, even if the final results are cached.
- The high cost of transactional I/O when using Virtual Instances.
Ideally large chunks of virtual instances could be read from the database at once (IE: All patient cases) rather than once at a time, but the scans are performed in a linear way.
Example: After a fairly simple index optimization, the query
instance('casedb')/case[@case_type='patient'][patient_status = 'new']
can eliminate required reads for the first predicate, but the remaining query will require a fetch for a potentially large, divided set of remaining case records.
The primary entry point for optimizations occurs in virtual instances at the root level.
When evaluating