Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

Closed
Tracked by #4588
gaiksaya opened this issue Apr 3, 2024 · 11 comments
Closed
Tracked by #4588

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

gaiksaya opened this issue Apr 3, 2024 · 11 comments
Assignees
Labels
bug Something isn't working v2.14.0

Comments

@gaiksaya
Copy link
Member

gaiksaya commented Apr 3, 2024

What is the bug?
It was observed in 2.13.0 and previous other releases that this component manually signed off on the release for failing integration tests. See opensearch-project/opensearch-build#4433 (comment)
The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Run integration testing for altering and see the failures.
  2. Issues can be reproduced using the steps declared in AUTOCUT issues for failed integration testing

What is the expected behavior?
Tests should be consistently passing.

Do you have any additional context?
Please note that this is a hard blocker for 2.14.0 release as per the discussion here

@gaiksaya gaiksaya added bug Something isn't working untriaged v2.14.0 labels Apr 3, 2024
@Swiddis Swiddis removed the untriaged label Apr 3, 2024
RyanL1997 pushed a commit to RyanL1997/dashboards-observability that referenced this issue Apr 18, 2024
…oject#1472) (opensearch-project#1670)

Signed-off-by: Kajetan Nobel <[email protected]>
Signed-off-by: Kajetan Nobel <[email protected]>
Co-authored-by: Stephen Crawford <[email protected]>
Co-authored-by: Darshit Chanpura <[email protected]>
Co-authored-by: Peter Nied <[email protected]>
Co-authored-by: Peter Nied <[email protected]>
(cherry picked from commit cfc83dd94eea02b5738bf607dd9866308814f2fc)

Co-authored-by: jakubp-eliatra <[email protected]>
@bbarani
Copy link
Member

bbarani commented Apr 23, 2024

@RyanL1997 @ps48 Can you please provide your inputs?

@RyanL1997 RyanL1997 self-assigned this Apr 24, 2024
@Swiddis
Copy link
Collaborator

Swiddis commented Apr 25, 2024

We're working on it, a while back I asked about the failures in opensearch-project/opensearch-build#4635, it doesn't look like the distribution failures are from our tests but somewhere in the pipeline as far as I can tell. I've marked our distribution issues with "help wanted" where the issue is applicable.

@Swiddis
Copy link
Collaborator

Swiddis commented Apr 25, 2024

It also looks like many of the manifests are still showing a Not Available status, related to the discussion here, but it's showing them even for fresh logs so it doesn't seem to be an issue of the manifests being stale.

@bbarani
Copy link
Member

bbarani commented Apr 25, 2024

Tagging @zelinh here to provide his inputs.

@zelinh
Copy link
Member

zelinh commented Apr 25, 2024

Here are some reasons that it may show Not Available. https://github.com/opensearch-project/opensearch-build/tree/main/src/report_workflow#why-are-some-component-testing-results-missing-or-unavailable
@Swiddis Could you share one situation that is showing Not Available so I can look into it in more details.

@Swiddis
Copy link
Collaborator

Swiddis commented Apr 25, 2024

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

@zelinh
Copy link
Member

zelinh commented Apr 25, 2024

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

I saw these in both of the unavailable runs. Seems like the process is terminated because of timeout limit when we run the integ tests for observabilityDashboards ; therefore it didn't run through all the test recording process.

Cancelling nested steps due to timeout
Sending interrupt signal to process

Session terminated, killing shell...Terminated
 ...killed.
script returned exit code 143

https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5856/consoleFull
https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5844/consoleFull
Both of these jobs run for more than 4 hours; while the available one run only 1.5 hour.
Do you have any idea why these jobs run longer than usual? @rishabh6788 @gaiksaya

@Swiddis
Copy link
Collaborator

Swiddis commented Apr 29, 2024

Hypothesis: The failing tests are flaky and the timeouts only happen if the tests pass (i.e. something later in the test suite is taking all the time). We only get the failure message when the earlier test fails and cuts the run short.

Based on this hypothesis I made opensearch-project/opensearch-dashboards-functional-test#1250 to fix the flakiness, but I'm still not sure what's causing the timeouts.

@Swiddis
Copy link
Collaborator

Swiddis commented Apr 30, 2024

For completeness I've checked the recent pipeline logs after the flakiness fix was merged, and am not seeing any integ-test failures for observability. https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/integ-test-opensearch-dashboards/runs/5899/log/?start=0

I can find the interruption exception, but not the indication of what specifically is being interrupted (is some test hanging?):

org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 5a075705-b450-4433-85c4-0b5d9991ba84
org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
		at org.jenkinsci.plugins.workflow.steps.BodyExecution.cancel(BodyExecution.java:59)
		at org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.cancel(TimeoutStepExecution.java:197)
		at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	Caused: java.lang.Exception: Error running integtest for component observabilityDashboards
		at WorkflowScript.run(WorkflowScript:317)
		at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(Docker.groovy:141)
		at ___cps.transform___(Native Method)
		at java.base/jdk.internal.reflect.GeneratedConstructorAccessor790.newInstance(Unknown Source)

@bbarani
Copy link
Member

bbarani commented Apr 30, 2024

Tagging @rishabh6788 to look in to the above failure ^

@Swiddis
Copy link
Collaborator

Swiddis commented Jun 14, 2024

Currently just held up by #1822

@ps48 ps48 closed this as completed Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.14.0
Projects
None yet
Development

No branches or pull requests

6 participants