Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Query Insight Plugin with Top Queries feature #11506

Closed
wants to merge 7 commits into from

Conversation

ansjcy
Copy link
Member

@ansjcy ansjcy commented Dec 7, 2023

Description

(parent RFC: #11186)
This PR implements the basic query insight framework and the "top N queries by latency" feature using this generic framework. More specifically, this PR includes:

  • The Top N queries service, listener, and related transport and REST endpoints.
  • Added asynchronous processing and exporting capability in query insight service to handle the data for query insight features. At the first iteration, the processor is now able to handle query latency data and enqueue to the aggregator and also export the aggregated data asynchronously to an OpenSearch index. This framework can potentially be used by other query insight features in the future to avoid adding blocking logic in core search path.
  • Added unit tests for features and API added.

How to use the API:

  1. First enable the top N queries insight feature
curl -X PUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d'
{
    "persistent" : {
        "search.insights.top_queries.latency.enabled" : "true",
        "search.insights.top_queries.latency.window_size" : "60s",
        "search.insights.top_queries.latency.top_n_size" : 5
    }
}'
  1. Insert documents for searching
curl -X POST "localhost:9200/my-index-0/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2023-12-01T13:12:00",
  "message": "this is my document",
  "user": {
    "id": "ansjcy"
  }
}'
  1. Do some search operations
curl -X GET "localhost:9200/my-index-0/_search?size=20&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "message": "document 2"
          }
        },
        {
          "match": {
            "user.id": "cyji"
          }
        }
      ]
    }
  }
}'
curl -X GET "localhost:9200/my-index-0/_search?size=20&pretty" -H 'Content-Type: application/json' -d '{}'
...
  1. Get top N queries by latency in the last 1 minute
curl -X GET "localhost:9200/_insights/top_queries?type=latency&pretty"

returns

{
  "top_queries" : [
    {
      "timestamp" : 1706746069075,
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 36,
        "fetch" : 2
      },
      "node_id" : "PsQkEubhT9S-ePsh906t-w",
      "total_shards" : 1,
      "search_type" : "query_then_fetch",
      "source" : "{\"size\":20,\"query\":{\"bool\":{\"must\":[{\"match_phrase\":{\"message\":{\"query\":\"document 2\",\"slop\":0,\"zero_terms_query\":\"NONE\",\"boost\":1.0}}},{\"match\":{\"user.id\":{\"query\":\"cyji\",\"operator\":\"OR\",\"prefix_length\":0,\"max_expansions\":50,\"fuzzy_transpositions\":true,\"lenient\":false,\"zero_terms_query\":\"NONE\",\"auto_generate_synonyms_phrase_query\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}}}",
      "indices" : [
        "my-index-0"
      ],
      "latency" : 45
    },
    {
      "timestamp" : 1706746069271,
      "total_shards" : 1,
      "search_type" : "query_then_fetch",
      "source" : "{\"size\":20}",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 19,
        "fetch" : 0
      },
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "IITrLUUXROCQehphz75Jsw",
      "latency" : 20
    },
    {
      "timestamp" : 1706746069135,
      "total_shards" : 1,
      "search_type" : "query_then_fetch",
      "source" : "{\"size\":20}",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 10,
        "fetch" : 2
      },
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "IITrLUUXROCQehphz75Jsw",
      "latency" : 18
    },
    {
      "timestamp" : 1706746069351,
      "total_shards" : 1,
      "search_type" : "query_then_fetch",
      "source" : "{\"size\":20}",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 2,
        "fetch" : 1
      },
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "_2E2035ZQvmEM9GMADl9Bw",
      "latency" : 9
    },
    {
      "timestamp" : 1706746069380,
      "total_shards" : 1,
      "search_type" : "query_then_fetch",
      "source" : "{\"size\":20}",
      "phase_latency_map" : {
        "expand" : 0,
        "query" : 5,
        "fetch" : 0
      },
      "indices" : [
        "my-index-0"
      ],
      "node_id" : "_2E2035ZQvmEM9GMADl9Bw",
      "latency" : 6
    }
  ]
}

Load Tests

~70 Load tests are performed using the nyc_taxis workload on different combinations of window sizes and top n values. No performance impact identified. Here are detailed benchmark results.

Feature off (Baseline)

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.68534 6.27264 6.73714 8.4323
2 (bc5e) 5.36776 5.80834 6.50946 28.6724
3 (bcdb) 5.18429 5.60393 9.2652 30.4254
4 (d74f) 5.02313 5.74386 6.7693 9.38909
5 (9244) 5.10541 5.47246 7.84308 8.63438
6 (b1de) 5.14018 5.49457 6.75883 9.80746
7 (217e) 5.09886 5.56152 8.21575 18.4278
8 (57d3) 5.26441 5.83722 9.92809 15.4894
9 (78f3) 5.30425 5.76678 9.24641 30.7725
10 (a2d6) 5.30458 5.82973 8.86554 13.9551
Median 5.22435 5.75532 8.02942 14.72225
Mean 5.24782 5.73911 8.01388 17.40058
St dev 0.18888 0.23304 1.272 9.25685

n=10, window size = 10 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.48015 5.93038 6.35181 8.1288
2 (bc5e) 5.12966 5.52368 6.05926 7.25804
3 (bcdb) 5.17215 5.66219 6.65964 7.45862
4 (d74f) 4.90608 5.57437 6.04869 7.64221
5 (9244) 5.49047 5.89037 6.41805 7.67218
6 (b1de) 5.06197 5.42041 6.89302 16.4436
7 (217e) 5.27588 5.63697 6.48004 8.57925
8 (57d3) 4.85925 5.20557 6.20037 8.81262
9 (78f3) 5.3572 5.81061 8.37155 14.7513
10 (a2d6) 5.53084 6.26242 8.44923 17.5828
Median 5.22401 5.64958 6.44905 8.35403
Mean 5.22636 5.6917 6.79317 10.43294
St dev 0.24109 0.29666 0.89066 4.10425

n=50, window size = 10 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.3265 5.76359 6.50887 8.87379
2 (bc5e) 5.15597 5.76559 6.26947 7.86013
3 (bcdb) 5.58544 6.05801 9.65796 15.5788
4 (d74f) 5.00877 5.44595 6.14539 9.64201
5 (9244) 5.39437 5.7242 6.42808 9.3807
6 (b1de) 4.99536 5.23857 5.80347 8.70772
7 (217e) 5.26149 5.76495 9.92437 18.695
8 (57d3) 5.19225 5.59769 5.82104 6.70231
9 (78f3) 5.28367 5.7321 6.18476 8.3956
10 (a2d6) 5.38787 5.97278 6.90825 7.99288
Median 5.27258 5.74785 6.34878 8.79076
Mean 5.25917 5.70634 6.96517 10.18289
St dev 0.1807 0.23671 1.52503 3.82823

n=100, window size = 10 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.42979 5.88745 8.21798 13.2944
2 (bc5e) 5.05335 6.07895 7.3364 9.17266
3 (bcdb) 5.41622 5.76504 6.77248 7.78023
4 (d74f) 4.79577 5.26539 5.69491 7.45989
5 (9244) 5.20676 5.57206 8.1644 18.6386
6 (b1de) 4.43616 5.02934 16.3021 18.167
7 (217e) 5.30738 5.75768 6.37761 8.36801
8 (57d3) 4.93365 5.59796 7.23298 30.6488
9 (78f3) 5.38238 5.8045 6.63352 9.21911
10 (a2d6) 5.314 5.84568 8.70836 28.0179
Median 5.25707 5.76136 7.28469 11.25676
Mean 5.12755 5.66041 8.14407 15.07666
St dev 0.3239 0.31059 3.01186 8.56873

n=10, window size = 60 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.48994 5.82642 9.2272 14.1245
2 (bc5e) 5.10504 5.70799 7.26535 10.31651
3 (bcdb) 5.20354 5.79872 8.79294 16.5674
4 (d74f) 5.34996 5.96373 6.80488 10.5772
5 (9244) 5.05346 5.62933 6.0651 8.00097
6 (b1de) 4.92265 5.42617 5.97952 7.94773
7 (217e) 5.20606 5.68424 8.25867 12.8365
8 (57d3) 5.04032 5.7306 6.50287 9.94117
9 (78f3) 5.22669 5.77208 6.68624 8.57578
10 (a2d6) 5.3514 6.00356 6.46163 8.47628
Median 5.2048 5.75134 6.74556 10.12884
Mean 5.19491 5.75428 7.20444 10.7364
St dev 0.17091 0.16478 1.15473 2.90134

n=50, window size = 60 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.66053 6.09658 10.03 16.7627
2 (bc5e) 5.17521 5.65421 8.03381 35.3631
3 (bcdb) 5.18126 5.70204 8.7235 31.0266
4 (d74f) 4.8835 5.13923 5.81303 9.60856
5 (9244) 5.38711 5.87948 6.43249 8.73293
6 (b1de) 4.73117 5.16525 5.63382 6.48858
7 (217e) 5.22232 5.66381 6.4244 6.79487
8 (57d3) 4.86784 5.3733 6.11748 8.90142
9 (78f3) 5.38244 5.89673 9.13197 28.2333
10 (a2d6) 5.53817 5.93063 8.72938 31.9991
Median 5.20179 5.68293 7.23315 13.18563
Mean 5.20296 5.65013 7.50699 18.39112
St dev 0.30303 0.3278 1.5949 11.8744

n=100, window size = 60 minutes

Runs 50th percentile 90th percentile latency 99th percentile latency 100th percentile latency
1 (2110) 5.61482 6.16978 9.96055 26.959
2 (bc5e) 4.80904 5.14862 6.18893 16.3675
3 (bcdb) 5.23928 5.77383 6.16525 8.2151
4 (d74f) 4.91612 5.31431 6.05033 9.32068
5 (9244) 5.43572 5.90742 8.29484 14.3912
6 (b1de) 4.8768 5.26618 5.81323 7.71826
7 (217e) 5.31596 5.80771 8.0518 33.2195
8 (57d3) 4.91712 5.31075 8.97082 28.1725
9 (78f3) 5.28335 5.62963 9.44923 12.0019
10 (a2d6) 5.48454 6.06844 6.63467 8.84661
Median 5.26132 5.70173 7.34324 13.19655
Mean 5.18928 5.63967 7.55797 16.52123
St dev 0.28816 0.36172 1.56758 9.4619

Related Issues

Resolves #11295 #11296

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 2, 2024

❕ Gradle check result for cd805a2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testRequestStats

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

github-actions bot commented Feb 3, 2024

❌ Gradle check result for 3b566be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 4, 2024

❌ Gradle check result for 8309af2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 4, 2024

❌ Gradle check result for 9675c7a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ansjcy ansjcy force-pushed the top-n-queries branch 2 times, most recently from bfd7e93 to 52218a6 Compare February 5, 2024 19:16
Copy link
Contributor

github-actions bot commented Feb 5, 2024

✅ Gradle check result for bfd7e93: SUCCESS

Copy link
Contributor

github-actions bot commented Feb 5, 2024

❕ Gradle check result for 52218a6: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@ansjcy ansjcy force-pushed the top-n-queries branch 2 times, most recently from 46b26b4 to f211425 Compare February 5, 2024 22:58
Copy link
Contributor

github-actions bot commented Feb 5, 2024

✅ Gradle check result for 46b26b4: SUCCESS

Copy link
Contributor

github-actions bot commented Feb 5, 2024

❌ Gradle check result for f211425: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 6, 2024

❕ Gradle check result for fba0556: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.classMethod
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.classMethod
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@ansjcy ansjcy changed the title [Draft] Query Insight Plugin with Top Queries feature Query Insight Plugin with Top Queries feature Feb 6, 2024
@ansjcy ansjcy changed the title Query Insight Plugin with Top Queries feature [Draft] Query Insight Plugin with Top Queries feature Feb 6, 2024
@ansjcy ansjcy marked this pull request as draft February 6, 2024 06:20
@ansjcy ansjcy removed the v2.12.0 Issues and PRs related to version 2.12.0 label Feb 6, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 10, 2024
@ansjcy ansjcy closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Query Insights stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Top N queries by Latency - aggregator implementation
4 participants