Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Additional aggregation in search request is changing results #14000

Closed
dennisoelkers opened this issue Jun 5, 2024 · 12 comments · Fixed by #14015
Closed

[BUG] Additional aggregation in search request is changing results #14000

dennisoelkers opened this issue Jun 5, 2024 · 12 comments · Fixed by #14015
Assignees
Labels
bug Something isn't working Search:Aggregations v2.15.0 Issues and PRs related to version 2.15.0

Comments

@dennisoelkers
Copy link

dennisoelkers commented Jun 5, 2024

Describe the bug

We have discovered erratic behavior for aggregation results introduced in v2.14.0 (double-checked against v2.13.0, results are okay there). We have the following query (the actual queries we are using make more sense, this is the minimal query we have found to trigger the issue):

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}

The underlying data in the index are test fixtures from out integration tests (check here) - format is that each document in $.documents[*].document has a index metadata object containing the index name/document id/document type and data containing the actual document.

Starting with v2.14.0, the result looks like this:

{
	"took": 102,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1000,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
		"timestamp-min": {
			"value": 1.664201539998E12,
			"value_as_string": "2022-09-26 14:12:19.998"
		},
		"http_method": {
			"meta": {},
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 1720
							}
						]
					}
				},
				{
					"key": "DELETE",
					"doc_count": 52,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 104
							}
						]
					}
				},
				{
					"key": "POST",
					"doc_count": 45,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 90
							}
						]
					}
				},
				{
					"key": "PUT",
					"doc_count": 43,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 86
							}
						]
					}
				}
			]
		}
	}
}

What is noteworthy is that for each of the buckets generated by the filter aggregations, the document count of the bucket that goes into the filter aggregation is half of the document count that comes out of the filter aggregation:

[...]
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 1720 <--- more documents in leaf bucket than in parent bucket
							}
						]
					}
				},
[...]

The initial document count (e.g. 860 for the GET bucket) is correct, the one in the bucket produced by the filter aggregation is not. When adding e.g. a sum metric aggregation as a leaf, the value returned is double of what is expected, so it looks like for some reason the actual documents are duplicated during the aggregation.

What is very strange is:

  • the exact same query returns correct results in all versions before v2.14.0
  • when removing the timestamp-min metric, the results are okay
  • when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

Related component

Search:Aggregations

To Reproduce

  1. Use this query:
{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}
  1. Execute query against sample data

Expected behavior

Expecting a result like this:

{
	"took": 4,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1000,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
                "timestamp-min": {
			"value": 1.664201539998E12,
			"value_as_string": "2022-09-26 14:12:19.998"
		},
		"http_method": {
			"meta": {},
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 860
							}
						]
					}
				},
				{
					"key": "DELETE",
					"doc_count": 52,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 52
							}
						]
					}
				},
				{
					"key": "POST",
					"doc_count": 45,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 45
							}
						]
					}
				},
				{
					"key": "PUT",
					"doc_count": 43,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 43
							}
						]
					}
				}
			]
		}
	}
}

Additional Details

Plugins
No plugins installed

Screenshots
-- Not applicable --

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@dennisoelkers
Copy link
Author

A git bisect turned up this commit to be the first introducing the issue: 795d868

@dennisoelkers
Copy link
Author

@jed326: Do you have an idea why that change produces these results?

@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

Thanks for reporting this @dennisoelkers. Are you seeing the issue with the filter aggregation only with concurrent segment search enabled or also with it disabled as well?

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@dennisoelkers Thanks for creating this issue

@dennisoelkers
Copy link
Author

@jed326: I am using default options. It looks like concurrent segment search is disabled by default?

From querying /<index>/_settings?include_defaults=true:

        "search": {
          "concurrent_segment_search": {
            "enabled": "false"
          },

@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

@dennisoelkers yes concurrent segment search is disabled by default. Could you quickly check if the same issue persists with concurrent search enabled?

These are both very interesting:

when removing the timestamp-min metric, the results are okay
when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

At a glance it looks like there might be something going wrong with agg scripting (there's a known issue for composite aggs w/ scripting #12947).

What's strange is the commit 795d868 ideally shouldn't affect the non-concurrent search path at all but perhaps that is not the case.

Let me take a deeper look at this today. If you have any pointers on how to set up the backing data for reproduction that would help a lot (for example if you have a java integration test that is indexing all this data).

@dennisoelkers
Copy link
Author

@jed326: Will try with CSS enabled, just want to drop you some aid for setting up a reproducible scenario. For bisecting, I was using the fixtures from one of our tests. In order to ingest it into an instance running locally, I wrote this script. It is assuming that OS is running on port 9200, with no TLS and no authentication (I just used ./gradlew run to start OS). The index it writes to is graylog_0.

@dennisoelkers
Copy link
Author

@jed326: Enabling CSS on the index and rerunning the query also returns results which are off.

@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

@dennisoelkers Thanks for checking and thanks for the reproduction aids. I will try to get to the bottom of this today!

@jed326 jed326 self-assigned this Jun 5, 2024
@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

Just confirming I checked out the parent commit 42f00ba and don't see the problem with either concurrent search enabled.
On commit 795d868 the same query from above shows this on both concurrent search enabled and disabled.

{
                    "key": "GET",
                    "doc_count": 860,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 927
                            }
                        ]
                    }
                },
                {
                    "key": "DELETE",
                    "doc_count": 52,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 60
                            }
                        ]
                    }
                },
                {
                    "key": "POST",
                    "doc_count": 45,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 51
                            }
                        ]
                    }
                },
                {
                    "key": "PUT",
                    "doc_count": 43,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 45
                            }
                        ]
                    }
                }

Tried re-indexing and see slightly different results so it's probably segment layout dependent.

@sohami sohami added the v2.15.0 Issues and PRs related to version 2.15.0 label Jun 5, 2024
@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

@dennisoelkers I was able to get to the bottom of this today. In 618782d we added some logic to unwrap the MultiBucketCollector to get the saved InternalAggregation objects here:

} else if (currentCollector instanceof BucketCollector) {
((BucketCollector) currentCollector).postCollection();
// Perform build aggregation during post collection
if (currentCollector instanceof Aggregator) {
((Aggregator) currentCollector).buildTopLevel();
} else if (currentCollector instanceof MultiBucketCollector) {
for (Collector innerCollector : ((MultiBucketCollector) currentCollector).getCollectors()) {
collectors.offer(innerCollector);
}
}
}

However there's a bug here where if a MultiBucketCollector is present then postCollection is going to get called twice for the collectors in the MultiBucketCollector -- once as a part of MultiBucketCollector::postCollection and then again when it's unwrapped to the individual collector.

This manifests as a problem in the BestBucketsDeferringCollector used for deferred collections as finishLeaf() will subsequently get called twice and we will get 2 deferred entries for the last leaf.

private void finishLeaf() {
if (context != null) {
assert docDeltasBuilder != null && bucketsBuilder != null;
entries.add(new Entry(context, docDeltasBuilder.build(), bucketsBuilder.build()));
}
}


To specifically address the "strange" points you shared above:

when removing the timestamp-min metric, the results are okay

The issue at hand is not specific to the min aggregation, if any additional agg is at the same level you will see this issue as the MultiBucketCollector is used when there are multiple aggregations at the same level. For example you will see the same problem in the following query:

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"other-terms": {
			"terms": {
				"field": "timestamp"
			}
		}
	}
}

when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

The issue with the double counted last leaf is specific to when the deferring collector is used. By default the collect mode when a script is present is breadth_first, which uses the deferring collector, while the default for the regular terms agg on the http_method is depth_first. If you manually set the collect mode you can still get the correct results with the painless script like so:

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				},
                "collect_mode" : "depth_first"
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}

@jed326
Copy link
Collaborator

jed326 commented Jun 5, 2024

Working on getting a PR out to address this bug, I think we should be able to get it into the 2.15 release. On 2.14 you can manually set the collect_mode to depth_first for now as a workaround (this might have some performance regressions on fields with very high cardinality though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Aggregations v2.15.0 Issues and PRs related to version 2.15.0
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants