Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [RFC] Grouping similar Top N Queries by Latency and Resource Usage #13357

Closed
deshsidd opened this issue Apr 23, 2024 · 17 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Performance Search Search query, autocomplete ...etc Telemetry:Performance PRs or issues specific to telemetry performance improvements

Comments

@deshsidd
Copy link
Contributor

deshsidd commented Apr 23, 2024

Is your feature request related to a problem? Please describe

Background

As part of the search query categorization initiative we added support to compute the query shape of the incoming search queries and logged (debug) the basic query shape. We also added instrumentations on the search path to collect the types of queries, types of aggregations, sort order, etc.

As part of Top N queries by latency, we implemented a priority queue-based in-memory data store, with configurable window size, on the coordinator node, designed to efficiently store the top N queries.

Problem

For Top N queries by latency, we can encounter scenarios where some (or most) of the Top N queries contain duplicate queries. Say the same dashboard query is triggered continuously and happens to be the most expensive query in terms of latency - in this scenario all the Top N queries by latency will likely be spammed by the same query. To overcome such scenarios and to get a more detailed view of the Top N query patterns we are proposing to implement Top N query shapes by resource usage (latency, memory, disk, etc).

Describe the solution you'd like

Design

Query Shape

For every incoming query, we will compute the query shape. The query shape is essentially lossy transformations on the actual query to help de-duplicate the top N queries. Consider the sample query:
{
  "query": {
    "bool": {
      "must": [
        { "term": { "product_category": "Electronics" } },
        { "term": { "product_description": "Smartphone" } },
        { "range": { "purchase_date": { "gte": "2022-01-01", "lte": "2022-12-31" } } }
      ],
      "must_not": [
        { "exists": { "field": "delivery_date" } }
      ],
      "filter": {
        "terms": { "customer_type": ["New Customers", "Returning Customers"] }
      },
      "should": [
        { "match": { "customer_feedback": "Great product" } }
      ]
    }
  },
  "sort": [{ "product_rating": {"order": "asc"}}],
  "aggs": {
    "terms_agg": {
      "terms": { "field": "product_category" },
      "aggs": {
        "date_histogram_agg": { "date_histogram": { "field": "delivery_date", "calendar_interval": "month" } },
        "avg_agg": { "avg": { "field": "product_rating" } },
        "sum_agg": { "sum": { "field": "product_rating" } },
        "min_agg": { "min": { "field": "product_rating" } },
        "max_agg": { "max": { "field": "product_rating" } },
        "cardinality_agg": { "cardinality": { "field": "product_category" } },
        "percentile_ranks_agg": { "percentile_ranks": { "field": "product_rating", "values": [25, 50, 75] } }
      }
    }
  }
}

The proposed query shape for the above query will look something like:

bool
  must:
    range [field:purchase_date, width:12, unit: months] 
    term [product_category:"Electronics"]
    term [product_description:"Smartphone"]
  must_not:
    exists [field:delivery_date]
  should:
    match [customer_feedback:"Great product"]
  filter:
    terms [customer_type:["New Customers","Returning Customers"]]
sort:
  asc [product_rating:asc]
aggregation:
  terms [field:product_category]
    aggregation:
      avg [field:product_rating]
      cardinality [field:product_category]
      date_histogram [field:delivery_date, calendar_interval:"month"]
      max [field:product_rating]
      min [field:product_rating]
      percentile_ranks [field:product_rating]
      sum [field:product_rating]
pipeline_aggregation:

We are capturing the shape of the query, sort, aggregations and pipeline aggregations. We capture the sort order, field names, field values and range width (difference between upper-bound and lower-bound of the range). The range width is captured to help us de-duplicate queries with the same range size but different window. For example, the same dashboard query that is run continuously but for different time ranges.

Normalize Query Shape

We need to normalize the query shape to make sure that 2 same queries always map to the exact same query shape including the ordering of the clauses, ordering of the queries within the clauses, ordering of the aggregations, etc. The following normalizations will be required:

  1. Ordering of Query Shape Major Components: The query shape as shown above is divided into 4 distinct parts. The ordering of these parts will be as shown above: QueryBuilder tree, Sort, Aggregation tree followed by Pipeline aggregations.
  2. Ordering of Bool Clauses: The order of the four boolean clauses will follow a lexicographic sequence as follows: filter, must, must_not followed by should.
  3. Order of Sub-Queries within a Clause: The order of sub-queries within a clause will follow a lexicographic sequence.
  4. Order of Aggregation types: The order of Aggregation types will follow a lexicographic sequence.
  5. Order of Pipeline aggregation types: The order of Pipeline aggregation types will follow a lexicographic sequence.
  6. Order of Sorts: Sorts will be lexicographically ordered based on sort order.
  7. Order of same query type, aggregation type, sort order: If multiple queries, aggregation types, pipeline aggregations or sorts have the same type we will sort based on the real field name (not obfuscated field name).

Top N Query Shapes

We will create a new processor in the Query Insights framework to keep track of the Top N query shapes in the current window, similar to what we do for Top N Queries by latency.

For every window, we will keep track of the hash to (count_of_queries, total_latency_across_all_queries). This will enable us to calculate the average latency for the queries. We will also have a priority query to keep track of the Top N queries for the window.

The algorithm will be as follows:

query_shape_insights

After every window we will clear the hash to (count, latency) mapping and Priority Queue. We will export the Top N query shape data for the previous window to a local index similar to what we are aiming to do for the Top N queries by latency.

APIs

We will be extending the Top N queries by resource usage APIs as part of query shape insights.
Existing APIs:
# get top n queries by latency in the last window
GET /insights/top_queries?type=latency
# get top n queries by cpu in the last window
GET /insights/top_queries?type=cpu
Additions:
# get top n query shapes by latency in the last window
GET /insights/top_queries?type=latency&group_by=shape
# get top n query shapes by cpu in the last window
GET /insights/top_queries?type=cpu&group_by=shape

Related component

Search

Describe alternatives you've considered

  1. We can use the actual query to de-duplicate the Top N queries by latency. However, this will not take into account queries that are exactly the same with the only difference being the range width. For example, the same dashboard query that is run continuously but for different time ranges. This will also not group the exact same queries that have ordering differences in the clauses.
  2. Use a more generic query shape that does not contain the field names, field values and range width information. Similar to:
bool
  must:
    term
    terms
  filter:
    constant_score
      filter:
        range
  should:
    bool
      must:
        match
      must_not:
        regex

This approach might end up grouping queries with varying latency characteristics into the same group. Example is the same query on 2 different fields exhibiting different latency due to the cardinality differences for the 2 fields (one high cardinality other low cardinality).

Please let me know your thoughts on the above proposal!

@deshsidd deshsidd added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 23, 2024
@deshsidd deshsidd self-assigned this Apr 23, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Apr 23, 2024
@deshsidd deshsidd removed their assignment Apr 23, 2024
@peternied peternied added RFC Issues requesting major changes Performance This is for any performance related enhancements or bugs Search:Performance Telemetry:Performance PRs or issues specific to telemetry performance improvements and removed untriaged labels Apr 24, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@deshsidd Thanks for creating this RFC, we are looking forward to seeing how this progresses.

@ansjcy
Copy link
Member

ansjcy commented Apr 24, 2024

@deshsidd Thanks for creating the RFC! This feature will be a super useful extension to the current top n queries feature!

Title: Top N Query Shapes by Resource Usage

How would you feel about renaming the title/scope of the RFC to something like “Grouping similar top queries together in the top n queries feature” ? Cause based on the RFC, that is what we are really trying to do. Also the grouping will be more than just “resource usage”, it can be a lot more metric types as well (like latency, volume) - arguably, “latency” shouldn’t be considered as one type of “resources”.

The proposed query shape will look something similar to..

It would be great if you can also provide the original query for this example shape, so that we can compare them side by side.

Normalize Query Shape ..

As part of the normalization and shape calculation, How would you feel about using hashing instead of the magicfield_x: value_x string? cause field_x, value_x doesn’t actually mean anything and we will need extra efforts to maintain them to avoid collision. Would hashing be a more generic solution for our use cases?

We will create a new processor in the Query Insights framework to keep track of the Top N query shapes in the current window, similar to what we do for Top N Queries by latency..
… We will also have a priority query to keep track of the Top N queries for the window.

Seems like you are suggesting we create a separate Priority Queue to keep track of the shapes along with the existing top n queries? IMO that would be a duplicated effort and will potentially have a negative impact on memory usage. What do you think about “extending the current processor and top n features to add a grouping mechanism” instead of making a whole new processor? In that case the user can either use the “vanilla” top n queries, or they can opt-int to do grouping by certain attributes - and shape can simply be one of the attributes.

APIs: We will be extending the Top N queries by resource usage APIs as part of query shape insights.

We don’t need to propose new APIs IMO. The user just need to make certain configuration changes. For example, if they want to use “top n queries with grouping”, they can do something like

PUT _cluster/settings -H 'Content-Type: application/json' -d'
{
    "persistent" : {
        "search.insights.top_queries.latency.enabled" : "true",
        "search.insights.top_queries.latency.window_size" : "1m",
        "search.insights.top_queries.latency.top_n_size" : 20,
        "search.insights.top_queries.latency.group_similar_queries" : true,
        // We only have "group by shape" now
        "search.insights.top_queries.latency.group_by" : "shape",
        // OR this if we don't want to expose the internal "shape" concept
        // "search.insights.top_queries.latency.group_granularity" : "low",
    }
}'

@jainankitk
Copy link
Collaborator

Thanks @deshsidd for writing this proposal. Most of it looks good, and storing hashes in memory is really nifty. Should prevent unnecessary memory consumption. Some considerations below:

Order of Sorts: Sorts will be lexicographically ordered based on sort order.

We need to be careful while doing the normalization as the execution/results might be different depending on the sort order. Probably we can compare the execution latency/resource consumption of current query with the average metrics for the shape so far (count_of_queries, total_latency_across_all_queries). Ideally should not be too off for it to be variation of the same shape?

GET /insights/top_queries?type=latency?group_queries=true

I will probably call these parameters as dimension for latency, cpu, memory and have enumeration for grouping. level(values being query,similar,shape) is something used for aggregation within node stats, although there might be better name as well. Also, for better readability type=latency ? &group_queries=true. Breaks my flow a bit! ;)

We don’t need to propose new APIs IMO.

Definitely +1 to that

The user just need to make certain configuration changes. For example, if they want to use “top n queries with grouping”, they can do something like

When it comes to settings/configurations, I like the principal of Less is more. Should cut the number of settings to atleast half (from 6 -> 3) in my opinion.

@msfroh
Copy link
Collaborator

msfroh commented Apr 25, 2024

I'd like to recommend that grouping similar queries should be the default behavior. Otherwise the TopN feature is likely to capture "the same" query for all N, since most search requests are programmatically generated. If there's one dashboard visualization powered by an expensive query (with relative time range), some N refreshes of that visualization will dominate the top N.

Maybe we can have a setting to turn it off, but it's almost always going to be a good thing.

It's probably also worth noting that "query shape" in this context really just refers to deduplicating with a (slightly) relaxed definition of equality. If we come up with the right definition (that reasonably matches what a human would consider "equivalent queries"), we might not even need a setting.

I suppose the counterexample would be cases where a normally fast query is slow because of a stop-the-world garbage collection, which would probably only show up in the non-deduped output. As a countercounterargument, though, capturing that query doesn't really give us any insight (since it's not a slow query in general -- it's just an unlucky victim of garbage collection and timing).

@ansjcy
Copy link
Member

ansjcy commented Apr 25, 2024

If we come up with the right definition (that reasonably matches what a human would consider "equivalent querie"), we might not even need a setting.

I really love this idea.

We could make deduplicating default to true in top n queries, and start with a setting to turn it off until we have a reasonable behavior defined for equivalent queries. @deshsidd what do you think?

@deshsidd deshsidd changed the title [Feature Request] [RFC] Top N Query Shapes by Resource Usage [Feature Request] [RFC] Top N Query Shapes by Latency and Resource Usage Apr 25, 2024
@deshsidd
Copy link
Contributor Author

Thanks for the feedback @peternied @ansjcy @jainankitk @msfroh

@ansjcy
Renamed the RFC to Top N Query Shapes by Latency and Resource Usage.

It would be great if you can also provide the original query for this example shape, so that we can compare them side by side

Agreed. Provided the query and the subsequent query shape in the edit above.

As part of the normalization and shape calculation, How would you feel about using hashing instead of the magicfield_x: value_x string? cause field_x, value_x doesn’t actually mean anything and we will need extra efforts to maintain them to avoid collision. Would hashing be a more generic solution for our use cases?

These are not hashed/obfuscated fields but are sample field names. Changed the query and query shape to real field names above to avoid any confusion. There will be no hashing or obfuscating of field names as part of the above proposal.

@msfroh @ansjcy @jainankitk

Seems like you are suggesting we create a separate Priority Queue to keep track of the shapes along with the existing top n queries?

I'd like to recommend that grouping similar queries should be the default behavior. Otherwise the TopN feature is likely to capture "the same" query for all N, since most search requests are programmatically generated. If there's one dashboard visualization powered by an expensive query (with relative time range), some N refreshes of that visualization will dominate the top N.

When it comes to settings/configurations, I like the principal of Less is more. Should cut the number of settings to atleast half (from 6 -> 3) in my opinion.

Looks like there are 2 different approaches we could potentially take here based on the above discussions.

1. Compute either Top N queries or Top N query groups but not both.

Have a configuration such as:
"search.insights.top_queries.latency.group_similar_queries" : true, which will be enabled by default.
When this configuration is enabled we only compute the query groups and not the individual top N queries. Note that each group will contain a sample query as an example query representing the queries in the group.

Pros:

  • Save of memory and resource usage since we will either compute top N query groups or Top N queries but not both.
  • Possibly more intuitive to use

Cons:

  • Users don't have the option to see real Top N queries & Top N query groups. Only one sample query provided per group.

2. Compute both the Top N queries and Top N query groups and provide the option in the API to view both.

# get top n query shapes by latency in the last window
GET /insights/top_queries?type=latency&group_by=shape

# get top n queries by latency in the last window
GET /insights/top_queries?type=latency

Pros:

  • Users have the option to see the real Top N queries and Top N query groups.

Cons:

  • Higher memory and resource usage since we will compute both Top N query groups and Top N queries.

Based on the above discussions looks like we are leaning towards option 1. If the memory and resource usage overhead is not much more in option 2 it will provide more flexibility to the user (might need to perform a POC to verify).

@deshsidd
Copy link
Contributor Author

@jainankitk

We need to be careful while doing the normalization as the execution/results might be different depending on the sort order. Probably we can compare the execution latency/resource consumption of current query with the average metrics for the shape so far (count_of_queries, total_latency_across_all_queries). Ideally should not be too off for it to be variation of the same shape?

In that case probably might be beneficial to skip the normalization for multiple sorts and leave it as is in the original query since the order of the execution matters.

@msfroh
Copy link
Collaborator

msfroh commented Apr 29, 2024

I want to call out that we should should not use the word shape anywhere in any APIs.

It's a word that I used when trying to address the problem of collecting anonymous statistics about queries (where we don't care about the specific fields/values being queried, but do care about the types of queries and the complex query structure).

In this case, we have the queries themselves -- we're just trying to avoid the situation where the Top N queries are all the "same" query.

@deshsidd
Copy link
Contributor Author

deshsidd commented Apr 29, 2024

Agreed. Keeping the above in mind I am proposing the following.

We can have a dynamic configuration for the grouping of Top N queries:

"search.insights.top_queries.latency.group_by" : "similarity"

The type of the setting is enum and the values could be none, similarity and in the future user_id (#13341). similarity could be the default option set.

We do not need any changes in the Top N queries APIs:

# get top n queries by latency in the last window
GET /insights/top_queries?type=latency
# get top n queries by cpu in the last window
GET /insights/top_queries?type=cpu

The output of these APIs will vary based on the search.insights.top_queries.latency.group_by setting configured.

Let me know your thoughts!

@deshsidd deshsidd changed the title [Feature Request] [RFC] Top N Query Shapes by Latency and Resource Usage [Feature Request] [RFC] Grouping similar Top N Queries by Latency and Resource Usage Apr 30, 2024
@andrross andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 14, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
@deshsidd
Copy link
Contributor Author

deshsidd commented Jun 5, 2024

Made good progress working on implementing the above feature. However, during the implementation came up with some issues that need to be further thought through.

1. Algorithm to group Top queries does not work as expected

In the algorithm that we used to attain the Top N query groups we missed taking into account the fact that we will require to update the average latency for a group already encountered and existing in the Priority Queue. If the average latency for this group decreases, this might not make it into the Top N query groups and can leave us with incorrect results.

Solutions:
a. Use Total latency rather than average latency
With total latency since the metric will never decrease in value the above issue can be overcome.
Con: This might pollute the Top N query groups with frequently occurring queries rather than query groups exhibiting high average latency.

b. Get Top N groups at the end of the window from the Hashmap
Rather than continuously calculating the Top N groups compute it at the end of the window.
Con: As the cardinality of the groups increases this approach becomes very inefficient.

c. Get Top N groups on a best effort basis (Proposed approach)
Keep the metric as average latency. If the average latency of a query group currently in the Top N (Priority Queue) decreases, check if it is lower than the lowest entry in the PQ. If so, delete the group from the Top N groups.

2. Current Implementation of Top N does not support updating Top N Records in the Priority Queue

Currently, we have a separate Priority Queue for tracking Top N requests by latency, cpu, memory.
We also have an initial LinkedBlockingQueue that buffers the records that will make it to any of the Top N priority queues (A records makes it to the LinkedBlockingQueue if the respective Priority Queue has not reached its capacity OR the smallest entry in the PQ is larger than metric for the current record). This LinkedBlockingQueue is drained every 5 seconds to add the records to the respective priority queues for latency, cpu and memory.

If Top N groups is enabled, we will need to change the current approach and now add ALL the records to the LinkedBlockingQueue and process the Top N groups while updating the respective Priority queues. This will enable us to update the average latency for a previously encountered group that already exists in the Top N query groups priority queue.

The new approach might cause additional overhead to the system when grouping is enabled since we are now processing ALL the records for each MetricType (Latency, CPU, memory). Previously, we were filtering out most of the records and retaining only the Top records as described above.

3. Grouping by individual metric type might not be possible

In the RFC, we proposed the following configurations that could be used to group Top N queries:
"search.insights.top_queries.latency.group_by" : "similarity".

However, based on the new approach proposed in 2 we will have to enable group_by for all Metric Types based on the current implementation.
"search.insights.top_queries.group_by" : "similarity"

This is because if group_by is enabled, we are proposing to add all query records to the LinkedBlockingQueue and not only the records that will make it to the Priority Queue. We need to do this for ALL metric types and cannot do this for individual metric types based on the current implementation.

Another aspect to consider here is how we will display these groups on the query insights dashboard that is currently being worked on.
We will have to either display a table with individual queries as rows and latency, cpu, memory as columns OR a table with query groups as rows and average latency, average cpu and average memory as columns. This might not be possible if we only enable grouping for a subset of metric type (say only for latency).

Need to get consensus on the above before we continue with the implementation and carefully think about some of these aspects. @msfroh @jainankitk @getsaurabh02 @ansjcy

@getsaurabh02
Copy link
Member

  1. Top N Queries by latency implementation #11904 does not support updating Top N Records in the Priority Queue

@deshsidd can you please explain what is the issue here specific to grouping. I am assuming the comparison logic for priority queue should follow the same logic for Top N, assuming the mapping of Groups to individual queries is outside in a separate map.

@ansjcy
Copy link
Member

ansjcy commented Jun 10, 2024

The challenging part for grouping is the key we use to sort items in prority queue can change constantly. But in the normal top n queries, the keys always remain the same. I did some thinking around this and I believe those challenges can still be solved in good time complexity.

We need to tweak the data structures a little bit to add another priority queue and a hashmap (so we have 1 min heap, 1 max heap and 1 hashmap).

  • the min heap is the same as the priority queue we use the top n queries. It stores top n shapes
  • the hash map stores group to similar query mapping, it should also store a flag to say "which heap this group already exists in".
  • the max heap is to store the remaining groups that hasn't made to the top n heap.

Refer to the following flow diagram for when / how to update each data structure
image
image

Also, if we continue using the normal priority queue, updating / deletion can take O(n) in a worst case scenario, which is not acceptable. But I remember indexed priority queue can support O(logn) time update and delete. We should use that for this purpose. Please refer to https://algs4.cs.princeton.edu/24pq/IndexMinPQ.java.html.

For the Current implemetation does not support updating Top N Records in the Priority Queue issue, if we use the above approach it should be good, also the logic to bypass adding records to the buffer can be disabled if grouping is enabled. We should do some benchmarking tests to understand what is the storage impact and what should be the best buffer size to set in this case. IMO this won't be a blocker for this feature.

As for Grouping by individual metric type might not be possible, again I don't think this is a hard blocker on this feature. I am ok to enable grouping for all metrics types at the same time and "search.insights.top_queries.group_by" : "similarity" looks good to me. Cause I don't see there's any strong use cases when users want to only enable grouping for certain metrics.

@deshsidd
Copy link
Contributor Author

deshsidd commented Jun 10, 2024

Thanks for the responses. Based on my initial comment please note the following:

  1. Algorithm to group Top queries does not work as expected (Resolved - Get Top N groups on a best effort basis)
    Keep the metric as average latency. If the average latency of a query group currently in the Top N (Priority Queue) decreases, check if it is lower than the lowest entry in the PQ. If so, delete the group from the Top N groups.

  2. Current implementation does not support updating Top N Records in the Priority Queue (In Progress)
    The main concern I had here was that the current buffer + priority approach will not work here. The logic to bypass adding the records to the buffer can be disabled if grouping is enabled and this will work. However, my concern was the performance and storage impact this will have. Is there any way to continue using this optimization with grouping?

My proposal would be to go with the approach to bypass adding the records to the buffer if grouping is enabled and do some benchmarking to figure out the performance and storage impact.

  1. Grouping by individual metric type might not be possible (Resolved - Can use one setting for grouping spanning across all metric types)

@ansjcy , regarding your approach mentioned, it seems like an overly complex approach that might not be required and might still be less performant.
The 2. if it exists in the max heap step will be O(n) time where n is the total number of groups in the window, right? The approach might be much slower as n increases. We can consider using a set to track this to avoid the O(n) but it will further increase the storage.
Furthermore, try add/update in the all groups heap step will be O(n) with a regular heap and O(logn) in the indexed priority queue which will add to the processing time as n increases.

What do you think of the approach mentioned above in 1. where we update the average latency of the record in the priority queue (PQ will have max 100 elements and this should be quick). If the average latency of a query group currently in the Top N (Priority Queue) decreases, check if it is lower than the lowest entry in the PQ. If so, delete the group from the Top N groups else keep it in the Priority Queue. Can also consider using an indexed PQ for better time complexity here.
Let me know your thoughts!

@ansjcy
Copy link
Member

ansjcy commented Jun 10, 2024

overly complex approach that might not be required and might still be less performant

In terms of time complexity, I don't think the proposed algorithm will be less performant, since you have a hashmap anyway to keep all the groups info, check if it exists in the max heap will only takes O(1) time. Also if you are using indexed priority queue, delete and update will be O(logn), which should be similar to constant time. I agree there will be storage impact, but again, we have to do some benchmarking tests, to understand how many groups could there be.

I'm also okay with the "the best effort approach". But the correctness will be a big concern. Is it possibly to test how accurate this method will be? If it is 90%+ it should be good.

@deshsidd
Copy link
Contributor Author

Right, looks like the time complexity will be pretty much similar in both the approaches (min and max heap approach and best effort approach). The storage impact will be more in the 2 heap approach. However, we are already storing all the record group objects in the hashmap - storing this in the PQ would essentially be pointers to the original objects and not take up too much more space.

I was also thinking about the correctness of "the best effort approach" and it might be hard to measure this and we can definitely get incorrect Top N with this approach.

Example: Consider a scenario where we have N = 5 and 5 query groups current priority queue (with average latency 5,4,3,2,1). A new query record comes in which is part of one of the 5 groups. We update the average latency and since the new average latency is less than the lowest entry in the PQ we remove it (PQ now is 5,4,3,1) and the evicted value was 0.9. We now get a new query group with 0.001 average latency which now makes it to the PQ since one slot is available (PQ now is 5,4,3,1,0.001).
This can leave us with a scenario where queries with very low latency that should never make it to the PQ do make it.

Keeping the above in mind and the fact that the time and space complexity are very similar in both the approaches - lets go with the 2 heap approach and do some benchmarking to figure out the performance and storage impact.

@deshsidd
Copy link
Contributor Author

deshsidd commented Aug 5, 2024

Inspired by the algorithm proposed here, here is the algorithm used in the PR:

Data Structures

     /**
     * Map storing groupingId to Tuple containing Aggregate search query record and boolean.
     * SearchQueryRecord: Aggregate search query record to store the aggregate of a metric type based on the dimension.
     * Example: Average latency. This query record will be used to store the average latency for multiple query records
     * in this case.
     * boolean: True if the aggregate record is in the Top N queries priority query (min heap) and False if the aggregate
     * record is in the Max Heap
     */
    private Map<String, Tuple<SearchQueryRecord, Boolean>> groupIdToAggSearchQueryRecord;
    /**
     * Min heap to keep track of the Top N query groups and is passed from TopQueriesService as the topQueriesStore
     */
    private PriorityQueue<SearchQueryRecord> minHeapTopQueriesStore;
    /**
     * The Max heap is an overflow data structure used to manage records that exceed the capacity of the Min heap.
     * It stores all records not included in the Top N query results. When the aggregate measurement for one of these
     * records is updated and it now qualifies as part of the Top N, the record is moved from the Max heap to the Min heap,
     * and the records are rearranged accordingly.
     */
    private PriorityQueue<SearchQueryRecord> maxHeapQueryStore;

Algorithm

=> New group added to the grouping service:

  1. Add to min PQ and overflow records to max PQ (if the number of records in the min PQ exceeds the configured size N)

=> Existing group being updated to the grouping service

  1. If aggregate record present in min PQ:
    • remove the record from the min PQ
    • update the aggregate record (aggregate measurement could increase or decrease)
    • If max PQ contains elements, add to max PQ and promote any records to min PQ (if agg measurement increases, record is pushed to max PQ and promoted again to min PQ | if agg measurement decreases record is pushed to max PQ and another record promoted to min PQ)
    • If max PQ is empty, add to min PQ (if agg measurement increases or decreases, record is pushed to min PQ since max PQ is empty)
  2. If aggregate record present in max PQ:
    • remove the record from the max PQ
    • update the aggregate record (aggregate measurement could increase or decrease)
    • If min PQ is full, add to min PQ and overflow any records to max PQ (if agg measurement increases, record is pushed to min PQ and a record overflows from min PQ to max PQ | if agg measurement decreases record is pushed to min PQ and same record overflows from min PQ to max PQ)
    • else, add to max PQ and promote any records to min PQ (if agg measurement increases, record is pushed to max PQ and promoted to min PQ, if agg measurement decreases, record is pushed to max PQ and another record promoted to min PQ)

We can do some benchmarking to check the performance here and update this to an index PQ or have limits on the number of groups to improve the performance if required.

@deshsidd
Copy link
Contributor Author

deshsidd commented Sep 5, 2024

Resolved in the following : opensearch-project/query-insights#66

@deshsidd deshsidd closed this as completed Sep 5, 2024
@github-project-automation github-project-automation bot moved this from Now(This Quarter) to ✅ Done in Search Project Board Sep 5, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Performance Search Search query, autocomplete ...etc Telemetry:Performance PRs or issues specific to telemetry performance improvements
Projects
Status: New
Status: Done
Archived in project
Development

No branches or pull requests

7 participants