Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat emptyPredicates as always true, instead of pruning them #17

Closed
wants to merge 2 commits into from

Conversation

acerbusace
Copy link
Collaborator

@acerbusace acerbusace commented Nov 1, 2024

Context

Currently empty predicates are ignored. Within an AND context (the normal context), that works correctly. Within an OR context (under an anyOf) it does not work correctly: empty predicates should essentially be treated as true:

  • predicate AND emptyPredicate should be like predicate AND true which reduces to predicate (and thus is equivalent to ignoring the empty predicate).

  • predicate OR emptyPredicate should be like predicate OR true which just reduces to true

Notably, filter: A and filter: {anyOf: [A]} should behave the same but right now they do not when A is an empty predicate – filter: emptyPredicate returns all results whereas filter: {anyOf: [emptyPredicate]} returns no results.

Change

This PR as mentioned above, looks to change filter: {field: nil} and filter: {field: {}} to evaluate to true. This in turns changes how anyOf behaves to be more like an OR block, as shown below:

  • filter: {anyOf: [emptyPredicate]} will evaluate to true, instead of false
  • filter: {anyOf: [emptyPredicate, {name: {equalToAnyOf: ["test"]}}]} will evaluate to true, instead of false

At the same time the behaviour of (filter: {anyOf: []}) matching no documents is still preserved.

Because of the way emptyPredicates are treated now, the following changes occured for how not behaves:

  • filter: {not: {field: nil}} will evaluate to false, instead of being ignored

@acerbusace acerbusace force-pushed the block/alexp/anyofchangesV4 branch 2 times, most recently from f6a4783 to f528b93 Compare November 1, 2024 18:19
@@ -369,6 +369,24 @@ class GraphQL

expect(parts).to target_all_widget_indices
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the following returns all indices, which is different then how we treat anyOf when filtering a query. Is this okay, or should we make changes to make sure everything is more aligned?

              parts = search_index_expression_parts_for({"any_of" => []})

@acerbusace acerbusace force-pushed the block/alexp/anyofchangesV4 branch from f528b93 to 6512d2c Compare November 1, 2024 18:46
sub_filter = build_bool_hash do |inner_node|
process_filter_hash(inner_node, expression, field_path)
end

return unless sub_filter
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this, because in theory we should never get a nil sub filter, as all emptyPredicates are treated as match_all and code coverage doesn't like a hanging branch. This however causes problems with steep, because build_bool_hash can return nil and we are trying to access [:bool]... So to fix this, I set sub_filter type to untyped. Pointing this out, incase anyone has any other paths I can take here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion about :not and :empty precedence will solve this in a more robust way.

@acerbusace acerbusace force-pushed the block/alexp/anyofchangesV4 branch 2 times, most recently from 5d3701c to c23dc6b Compare November 1, 2024 19:10
Copy link
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @acerbusace. This is really great to see! Left some suggestions and thoughts.

@@ -102,13 +197,20 @@ def filters_on_sub_fields?(expression)
end
end

def process_empty_or_nil_expression(bool_node, field_or_op)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call this process_empty_expression? I consider both field: nil and field: {} to be empty expressions (after all, identify_node_type returns :empty for both cases).

@@ -57,11 +59,104 @@ def to_s

private

def reduce_query(bool_node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an invariant I expect from this: I expect it to produce behaviorally equivalent queries to what would be produced if queries were not reduced. That is: for every possible query, we should get the same results from Elasticsearch whether or not the query has been reduced.

However, if you toggle this reduction (e.g. by changing the first line to return bool_node) I found it causes some integration and acceptance specs to fail. (I expect it to break some unit specs since those assert on the query body itself, and I'm not concerned with those). Specifically, it breaks these:

rspec ./elasticgraph-graphql/spec/integration/elastic_graph/graphql/datastore_query/sub_aggregations_spec.rb:196 # ElasticGraph::GraphQL::DatastoreQuery sub-aggregations ignores empty filters
rspec ./elasticgraph-graphql/spec/acceptance/search_spec.rb:521 # ElasticGraph::GraphQL--search with a snake_case schema `list` filtering behavior supports filtering on scalar lists, nested object lists, and embedded object lists
rspec ./elasticgraph-graphql/spec/acceptance/search_spec.rb:521 # ElasticGraph::GraphQL--search with a camelCase schema, alternate derived type naming, and enum value overrides `list` filtering behavior supports filtering on scalar lists, nested

Let's start with the acceptance specs. Here they are:

results = query_teams_with(filter: {current_players_nested: {any_satisfy: {name: {equal_to_any_of: nil}}}})
expect(results).to eq [{"id" => "t1"}, {"id" => "t2"}, {"id" => "t3"}, {"id" => "t4"}]

Without the reduction, only t1, t2, and t3 are returned--so the act of reducing causes it to match t4 when it would not normally.

I dug into why and I think it points to yet another bug with how we handle empty predicates! In this case it's a bug under any_satisfy. In this case, the filter is matching teams which have any players that satisfy name: {equal_to_any_of: nil}. While the name: {equal_to_any_of: nil} criteria translates into a match_all, that's a match_all against players, not teams. The outer criteria (the any_satisfy) is against teams and can only be satisfied by teams that have at least one that player that satisfies the inner player criteria. A team which has no current players (as is the case with t4) cannot satisfy the filter and should be omitted, I think.

That said, this might make for a little bit of a usability problem for ElasticGraph--if a query with has an optional filter on a nested field, and a client does not want to filter on that field and omits, should we then still filter on whether or not documents have any nested records? Something we should think more about.

The failing integration spec is related to this case as it also involves a nested field. in this case, it changes the query in such a way that there is extra meta returned by the aggregation. I think it might be ok but we should think about it some more.

If we do want strict enforcement of the "query reduction should not impact query behavior" invariant, we may want to actually setup something in our test so that integration and acceptance specs run each query with and without the reduction and assert on getting the same results.

More generally, this code is very, very complicated and I don't yet understand it. Since it exists purely for optimization (not correctness) we need to be 100% sure it's correct; otherwise we should not do the reduction.

One option to consider: remove the reduction from this PR (so that this PR is just focused on the behavioral change we want to make) and then do a follow up PR that adds the reduction. Once nice property of that is it would make it more visible/obvious in the diff what the reduction is doing as it would show up in changes to the unit specs.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to the idea of having an enforcement on the "query reduction should not impact query behaviour". I guess first, we will need to figure out what to do with the any_satisfy bug you caught (as my reduction, was mostly to match the current behaviour).

Comment on lines 158 to 159
# This is an "empty" filter predicate and we can treat it as `true`.
process_empty_or_nil_expression(bool_node, field_or_op)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This is an "empty" filter predicate and we can treat it as `true`.
process_empty_or_nil_expression(bool_node, field_or_op)
process_empty_or_nil_expression(bool_node, field_or_op)

No need for the comment. It was there before to explain why there was nothing in the when :empty branch but now there's something.

@@ -102,13 +197,20 @@ def filters_on_sub_fields?(expression)
end
end

def process_empty_or_nil_expression(bool_node, field_or_op)
if field_or_op == schema_names.not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a code smell that the handling for not is spread out--it's handled primarily by process_not_expression but also by this method. we want each different type of filtering node to be handled in one place and not have its responsibility spread out.

I understand why you did this, though: the current FilterNodeInterpreter implementation does not allow the :empty case to just handle the empty case and allow the :not case to just handle the not case, because it returns :empty for both field: nil and not: nil. The current implementation looks like this:

def identify_node_type(field_or_op, sub_expression)
  return :empty if sub_expression.nil? || sub_expression == {}
  return :not if field_or_op == schema_names.not
  return :list_any_filter if field_or_op == schema_names.any_satisfy
  return :all_of if field_or_op == schema_names.all_of
  return :any_of if field_or_op == schema_names.any_of
  return :operator if filter_operators.key?(field_or_op)
  return :list_count if field_or_op == LIST_COUNTS_FIELD
  return :sub_field if sub_expression.is_a?(::Hash)
  :unknown
end

Notice that it detects :empty based on sub_expression before it has the chance to detect :not based on field_or_op. That robs process_not_expression from the chance to negate the inner empty expression.

I'm thinking that we should always do the detection from the "outside in". That is, field_or_op is always on the outside and we should first detect a node type based on that. Only once we've exhausted what we can know based on field_or_op should we detect the type based on the sub_expression. I think the implementation should instead be something like this:

def identify_node_type(field_or_op, sub_expression)
  identify_by_field_or_op(field_or_op) || identify_by_sub_expr(sub_expression) || :unknown
end

private

def identify_by_field_or_op(field_or_op)
  return :not if field_or_op == schema_names.not
  return :list_any_filter if field_or_op == schema_names.any_satisfy
  return :all_of if field_or_op == schema_names.all_of
  return :any_of if field_or_op == schema_names.any_of
  return :operator if filter_operators.key?(field_or_op)
  return :list_count if field_or_op == LIST_COUNTS_FIELD
end

def identify_by_sub_expr(sub_expression)
  return :empty if sub_expression.nil? || sub_expression == {}
  return :sub_field if sub_expression.is_a?(::Hash)
end

Now, that change on its own probably breaks some stuff (the various process_* methods haven't had to handle empty or nil sub-expressions before because the :empty branch took precedence) but it could be worthy prepatory refactoring in its own right to first fix identify_node_type in its own PR, and then leverage it here so that this can just return match_all and process_not_expression can simply negate it.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Originally, I thought of changing the order of when :not and :empty is returned, so this can be handled in the corresponding method. However, I think this is a better approach, where each method is handling this individually.

sub_filter = build_bool_hash do |inner_node|
process_filter_hash(inner_node, expression, field_path)
end

return unless sub_filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion about :not and :empty precedence will solve this in a more robust way.

@acerbusace acerbusace force-pushed the block/alexp/anyofchangesV4 branch from 99b37dd to 6b280b7 Compare November 4, 2024 15:26
@myronmarston myronmarston deleted the block/alexp/anyofchangesV4 branch November 12, 2024 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants