Skip to content

Performance: XPath Query Evaluation

ctsims edited this page Mar 29, 2017 · 15 revisions

Pre-Requisites

Much of this writeup won't make sense without having read the Virtual Instances document and having an understanding of how those instances work.

It is also generally useful to understand what an XPath filter (predicate) expression is (like /data/weights[@scale = 'kg'])) at a high level, and specifically understanding the notion of the evaluation context (IE: That inside of the predicate expression you are referring to a relative, nested evaluation frame).

You should almost certainly read the XPath Query Evaluation Primer before reading this document.

Overview

In CommCare, data is read and manipulated with XPath.

XPath Expressions can contain functions, operators, and other element to produce a result, like

min(/data/inputs[@priority > 5][0] + 15, /data/fallback)

The majority of performance considerations in CommCare come from XPath expressions which reference data models (like /data/fallback, especially when referring to potentially large datasets, like cases or fixtures, and most specifically when referring to datasets with filters (like [@priority > 5][0]).

This document will work through the way that those queries are performed, what performance considerations exist in the code, and how our optimizations work.

Performance Challenges

There are primary performance concerns with the native process for XPath query evaluations in practice with CommCare specifically, each of which have their own relevance for optimizations

The reliance on linear scans of potentially long working nodesets while evaluating queries

Example: instance('casedb')/casedb/case[@case_id='unique_id']/case_name

requires a full linear scan of each <case> element by default.

The ease with which nested queries can produce highly non-linear runtimes

Example: Fetching a case's siblings

  count(instance('casedb')/casedb/case
                                        [
                                        index/parent 
                                        = 
                                        instance('casedb')/casedb/case
                                                                      [
                                                                      @case_id
                                                                      =
                                                                      'unique_id'
                                                                      ]
                                                                      /index/parent
                                        ]
        )

requires a full scan of the case database for every case in the database O(N^2), despite one set being a reference to a constant.

  1. Redundancy in evaluations

When optimizing for situations like nested queries there is a significant amount of redundancy in the processing, even if the final results are cached.

  1. The high cost of transactional I/O when using Virtual Instances.

Ideally large chunks of virtual instances could be read from the database at once (IE: All patient cases) rather than once at a time, but the scans are performed in a linear way.

Example: After a fairly simple index optimization, the query

instance('casedb')/case[@case_type='patient'][patient_status = 'new']

can eliminate required reads for the first predicate, but the remaining query will require a fetch for a potentially large, divided set of remaining case records.

Optimization Structure

The primary entry point for optimizations occurs in virtual instances at the root level.

When evaluating