Skip to content

[Proposal] Entity List Extensions

ctsims edited this page Nov 14, 2017 · 3 revisions

This document outlines a proposed extension to the entity list construct.

It is not implemented, and should not be presumed to be supported in any future version.

Motiviation

There are many circumstances in which users want to display a single choice to a user which is dependent on the existence of 0-N items, rather than the existence of a single item.

Imagine the basic construct of "Display all mothers with children who need an upcoming vaccination", where each person (mother/child) is a case type, and all child cases are children of the parent mother case.

In CommCare currently, the way this would need to be displayed would require multiple transforms to be written, for both filtering and display. For example, selecting all child cases which require a vaccination may be expressed as

instance('casedb')/casedb/case[@case_type = 'child'][next_immunization_due < today()]

But getting a list of parent cases which meet this filter requires a significantly more convoluted transform

instance('casedb')/casedb/case[@case_type = 'parent'][
   count(
      instance('casedb')/casedb/case[@case_type = 'child']
                                    [index/parent = current()/@case_id]
                                    [next_immunization_due < today()]
   ) > 0]

This transform also requires a highly complex static analysis to bulk fetch child cases to be performant, alternatively, this query will require significant numbers of database lookups to track down each child case.

After the initial filter, any future displays of information which depend on the "Child Set" will also require consistent duplication of the filter set `(type, index/parent, next_immunization_due), which is both error prone and difficult to optimize.

Design

Grouping

To improve the simplicity, expressiveness, and comprehensibility of this process, the entity list screen could rely on a reduction of the data set based on matching keys.

The incoming data production to the entity list would remain a core nodeset

input[]

which is iterated over a single time to produce the transformed entity objects.

The entity list would support a new construct internally reduce.

<reduce group-by=''>
    ...
</reduce>

reduce requires an input XPath Expression (group-by) which will produce a scalar value when executed with an item in the nodeset as its context (the same as a or variable).

If a reduce is present, the entity list will only display one item per unique value in the reduced set. For the rest of this document we will refer to the resultant value as the reduction_id.

Example: With an input set case[] and a group function index/parent

nodeset:
instance('casedb')/casedb/case[@case_type = 'child'][filter = 'value']

=> input[]
instance('casedb')/casedb/case[34]
instance('casedb')/casedb/case[50]
instance('casedb')/casedb/case[52]
instance('casedb')/casedb/case[90]

reduce():
instance('casedb')/casedb/case[34]/index/parent => parent_one
instance('casedb')/casedb/case[50]/index/parent => parent_two
instance('casedb')/casedb/case[52]/index/parent => parent_one
instance('casedb')/casedb/case[90]/index/parent => parent_one

The resulting entity list would contain two values, one for each reduction_id. By default it should be assumed that the first matching result will be what is displayed in the list, so with no other changes the context for the two items used for the <field> mapping transforms would be

instance('casedb')/casedb/case[34]
instance('casedb')/casedb/case[50]

Reduction

In addition to allowing the list to essentially be filtered, the reduction block will also allow for fold functions to be defined which can perform a vector -> scalar transform for the items being grouped.

These functions will set a variable value in the entity node

    <variable_name base="" fold=""/>

Where base is an xpath function which will be executed to initialize the variable value for the first nodeset result which matches its reduction_id, and fold is an xpath function which will be executed against later nodeset results matching that reduction_id.

Each variable is then made available to use inside of its fold function.

Two examples are provided here:

<reduce group-by='index/parent'>
    <total_amount base="./amount" fold="$total_amount + ./amount/>
    <first_invoice_due base="./invoice_due" fold="min(date($first_invoice_due), date(./invoice_due))"/>
</reduce>

In the first example, $total_amount is provided for a given reduction_id by adding to an ongoing accumulating counter. The counter is initialized to '0' to prevent the need for a check within the fold function to init the value.

In the second example, the first_invoice_due value is set to the smallest date value for any of the cases matching the reduction_id meet the condition.

Full Example

Imagine a set of cases exists of type invoice, where each invoice is recorded against a potentially large list of locations, which are represented by a fixture.

A reasonable task would be to show a list of locations with a count of invoices open against them where only locations with open invoices are included, such as

Location | Open Invoices | Total Amount
---------------------------------------
Central  |             3 |         $120
West     |             1 |         $100
North    |             2 |          $80

would require a setup like

nodeset:
instance('locations')/locations/location[@type = 'client'][count(
    instance('casedb')/casedb/case[@case_type = 'invoice'][location_assigned = current()/@id]
                                                                ) > 0]

Location:
./name

Open Invoices:
count(instance('casedb')/casedb/case[@case_type = 'invoice'][location_assigned = current()/@id])

Total Amount:
sum(instance('casedb')/casedb/case[@case_type = 'invoice'][location_assigned = current()/@id]/total)

Which breaks down to be quite repetitive (and thus easy to commit mistakes on), and also being very, very difficult to optimize, since the app needs to count assigned locations in 3 different places. Each of those counts requires a full walk of the voucher set.

In the new paradigm this could be expressed as

nodeset:
instance('casedb')/casedb/case[@case_type = 'invoice']

reduction step:
   group-by: ./location_assigned
   folds:
     count: count + 1
     total: total + ./amount

Location:
instance('locations')/locations/location[@id = $reduction_id]/name

Open Invoices:
$count

Total Amount:
$total

This process only requires a linear walk that is as large as the set of invoices cases.

Implementation Notes/Usage Details

Since the full list will need to be evaluated at least one time before each folded variable is set, this format will require at least two passes over the full list of values. One pass which occurs over the full input[] nodeset to produce the folded variables, and then another over the elements of the nodeset which represent their unique reduction_id.

The <reduce> block variables will only be available within their own fold function, or to the entity item itself ( and tags) after the reduce step has completed, they cannot be referenced in their intermediate state by the other variable definitions during the reduce step.

Since it is very likely that there will be a transform from the reduction_id to another set of models (Say, mapping a list of child cases to a set of parent cases), after the reduce step completes, it is essential for platform implementations to register $reduction_id as a model set.

Commentary

CS: Currently the most important issue here is that we're moving to a world where we have better static assumptions about what data is represented by selections, and this potentially gets us further away. In HQ for instance we'd likely define the nodeset in the short term as a case select over children, but the output case would end up being a parent case, which would be super confusing, and likely needs to be accounted for somewhere in the data structure.

CS: We probably need to allow for a filter over the reduced set transform in this, but it's unclear precisely where it is best applied. Probably after reduction, but potentially would be relevant before :/

Performance

In order to maintain a reasonable performance, the variables being used in evaluation will need to be made available as a flavored model set.

IE: in the same way that current()/@case_id is identified as a model set which can have a transform provided against it to ensure that both child and parent cases only do a single db scan currently, it could be desirable for the reduced nodeset to be calculated for all values first (especially for indices) and have that set be retained as a case model set.