Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Potential Cluster Slowdown/Lags after merging #13748(#14348) #14338(#14391) in 2.15 #14442

Closed
peterzhuamazon opened this issue Jun 18, 2024 · 12 comments · Fixed by opensearch-project/opensearch-dashboards-functional-test#1421
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs v2.15.0 Issues and PRs related to version 2.15.0

Comments

@peterzhuamazon
Copy link
Member

Describe the bug

During 2.15.0 release process, Infra has generated multiple Release Candidates (RC) for testing before launch date.

Between RC3 and RC4, the only two major changes are: opensearch-project/opensearch-build#4785

Once generated RC4, we have observed potential slow downs in Dashboards related plugin integTests.

These failures are not all the same, but observe very similar symptoms, as elements are not loaded in on time. These behaviors seems as if OS is slowing down causing OSD GUI to load much slower as a side effect.

  1) Home(Get Started) page
       should load Home page properly:
     AssertionError: Timed out retrying after 10000ms: Expected to find content: 'Get started' within the selector: 'h1' but never did.
      at Context.eval (http://localhost:5601/__cypress/tests?p=cypress/integration/plugins/security/get_started_spec.js:170:10)
  1) Cypress
       "before all" hook for "Visits Reporting homepage":
     AssertionError: Timed out retrying after 10000ms: Expected to find element: `div[data-test-subj="sampleDataSetCardflights"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Cypress`
      at Context.eval (http://localhost:5601/__cypress/tests?p=cypress/integration/plugins/reports-dashboards/01-create.spec.js:167:8)
  1) Test PPL UI
       "before each" hook for "Confirm results are empty":
     AssertionError: Timed out retrying after 10000ms: Expected to find element: `.euiButton__text[title=PPL]`, but never found it.

Because this error occurred during a `before each` hook we are skipping the remaining tests in the current suite: `Test PPL UI`
      at Context.eval (http://localhost:5601/__cypress/tests?p=cypress/integration/plugins/query-workbench-dashboards/ui.spec.js:207:8)

We are able to reproduce these error locally as well. After switching from RC4 to RC3 artifacts, these errors are gone and test passes.

Please take a look and let us know the results.
Thanks!

Related component

Other

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

OSD integTests passes.

Additional Details

Plugins
All

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):
All Dist/Arch

Additional context
Add any other context about the problem here.

@peterzhuamazon peterzhuamazon added bug Something isn't working untriaged labels Jun 18, 2024
@github-actions github-actions bot added the Other label Jun 18, 2024
@peterzhuamazon peterzhuamazon changed the title Potential Cluster Slowdown/Lags after merging #13748(#14348) #14338(#14391) in 2.15 [BUG] Potential Cluster Slowdown/Lags after merging #13748(#14348) #14338(#14391) in 2.15 Jun 18, 2024
@peterzhuamazon peterzhuamazon added the v2.15.0 Issues and PRs related to version 2.15.0 label Jun 18, 2024
@getsaurabh02
Copy link
Member

@SwethaGuptha @shwetathareja @soosinha Can we please verify and root cause if these changes merged between RC3 and RC4 can potentially slow down OS, thereby causing delay of OSD objects and leading to assertion failures (timeouts) on multiple fronts in the dashboard plugins.

cc: @dblock @Pallavi-AWS @rramachand21

@zelinh
Copy link
Member

zelinh commented Jun 18, 2024

Reproduced the with-security test failure for securityDashboards locally.

  7 passing (36s)
  1 failing

  1) Home(Get Started) page
       should load Home page properly:
     AssertionError: Timed out retrying after 10000ms: Expected to find content: 'Get started' within the selector: 'h1' but never did.
      at Context.eval (http://localhost:5601/__cypress/tests?p=cypress/integration/plugins/security/get_started_spec.js:170:10)

@shwetathareja
Copy link
Member

@zelinh : The local repro is with RC4? Can you try if it reproduces with RC3 as well?
Also, are you able to profile OSD to confirm if the extra time spent was in OSD itself or due to delay in OS process bootstrapping.

@SwethaGuptha / @shiv0408 to help profile the 2 PRs merged in 2.15 RC4 which are the suspects:
#14348
#14391

@shwetathareja
Copy link
Member

shwetathareja commented Jun 19, 2024

The remote cluster state is disabled by default #14391 made some fix related to it and also the bug fix made by #14348 is for fetching unassigned shards details in batch mode which is disabled by default.

The team can compare bootstrap times for OS process with RC3 and RC4

@shiv0408
Copy link
Member

@peterzhuamazon I tried bootstrapping the OpenSearch process from RC3 and RC4 binaries. In fact, the time to bootstrap for RC3 (2.671s) was higher than RC4 (2.491s) on my system.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@peterzhuamazon Thanks for creating this issue

@peterzhuamazon
Copy link
Member Author

Hi,

@shwetathareja @shiv0408 We use RC3 and RC4 for the test.
And securityDashboards always succeed on RC3 but failed on RC4.

As you can see this PR the only change between RC3 and RC4 code wise is these two PRs:

Thanks.

@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Jun 19, 2024

We are suspecting there is some side effects of these two PRs causing Dashboards to slow down.
On OpenSearch itself since I dont have enough context on the changes, I do not know if they actually slows down the bootstrap time or not.

I do think this is not just on bootstrap time, but rather when running actually queries where some returns slows down causing dashboards to not load elements quick enough in time for the tests to pass. Thanks.

@shiv0408
Copy link
Member

@peterzhuamazon these PRs should not cause any side effects as both of these codes only run when specific settings are enabled, which are disabled by default.
I agree from errors it looks like not just bootstrap, but slowdown in loading UI seems to be the cause. There is something else we are missing which is causing these slowdown, these PRs can't be the culprits.

@soosinha
Copy link
Member

Found a commit in opensearch-dashboards-functional-test repo : opensearch-project/opensearch-dashboards-functional-test@9c17a6c
This commit had changed the timeout of tests from 60s to 10s due to which a lot tests started failing.
So the PRs in OpenSearch repo mentioned previously should be good to go ahead with.

@peterzhuamazon
Copy link
Member Author

Found a commit in opensearch-dashboards-functional-test repo : opensearch-project/opensearch-dashboards-functional-test@9c17a6c This commit had changed the timeout of tests from 60s to 10s due to which a lot tests started failing. So the PRs in OpenSearch repo mentioned previously should be good to go ahead with.

Thanks @soosinha for the debug together.

I will soon send a PR to revert this commit.

And proceed with the RC5 builds. Sorry for confusion.

@peterzhuamazon
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs v2.15.0 Issues and PRs related to version 2.15.0
Projects
7 participants