Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix updating the parent span #9450

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Gaganjuneja
Copy link
Contributor

Description

Tracing framework after ending the current span internally set the parent as a current span in the thread context. But after this change #8428 now parent can be ended even before the span so the assumption that parent should always be closed after child doesnt hold true. Now setting the already ended parent as a current span doesn't make sense. Adding parent as a current span only if it is still recording

Related Issues

Resolves #9449

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gagan Juneja <[email protected]>
Gagan Juneja added 2 commits August 20, 2023 22:31
Signed-off-by: Gagan Juneja <[email protected]>
Signed-off-by: Gagan Juneja <[email protected]>
@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change bf10ff7

Incompatible components

Incompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.CloneSnapshotIT.testCloneShallowSnapshotIndex
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled

@codecov
Copy link

codecov bot commented Aug 20, 2023

Codecov Report

Attention: Patch coverage is 75.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 71.08%. Comparing base (bf10ff7) to head (5d0f21e).
Report is 983 commits behind head on main.

Files Patch % Lines
...ava/org/opensearch/telemetry/tracing/OTelSpan.java 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9450      +/-   ##
============================================
- Coverage     71.17%   71.08%   -0.10%     
- Complexity    57451    57459       +8     
============================================
  Files          4776     4776              
  Lines        270795   270798       +3     
  Branches      39580    39581       +1     
============================================
- Hits         192743   192501     -242     
- Misses        61852    62061     +209     
- Partials      16200    16236      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.pit.DeletePitMultiNodeIT.testDeleteWhileSearch

@Gaganjuneja
Copy link
Contributor Author

@reta Please take a look.

@@ -77,7 +77,11 @@ public SpanContext getCurrentSpan() {
private void endSpan(Span span) {
if (span != null) {
span.endSpan();
setCurrentSpanInContext(span.getParentSpan());
if (span.getParentSpan() != null && !span.getParentSpan().hasEnded()) {
Copy link
Collaborator

@reta reta Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gaganjuneja the thing you're trying to fix should never happen, I think we are managing context restoration incorrectly. If the execution is synchronous, thing are clear: the parent could not be finished before children, now the asynchronous case is different:

  • the parent span and child spans do not belong to the same thread
  • when the we start new span scope, we should capture the current span associated with the current thread (which could be not the same as the parent span)
  • when the span is ended, we have to restore the span that was active on that particular thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta for taking a look.
I had run a couple of tests and deep dive to understand the issue.

As of now Our context restoration in based on the fact that

* A ThreadContext is a map of string headers and a transient map of keyed objects that are associated with
Every opensearch thread is managed by a thread pool or executor being responsible for stashing and restoring the threads context. So ideally even we are setting the wrong/ended parent but it should be removed once the thread call is over and context is restored. One specific scenario where this is not happening is scheduled cluster monitoring actions.

Screenshot 2023-08-27 at 10 08 16 AM

Here if at Thread1 the context is not stashed the state will be persisted. This exactly happens with the NodeStatsTransportAction and possibly with other actions where startSpan through TransportService::sendRequest is creating the span and persisting with the current Management thread without stashing the context after the call is over.

We can fix this by adding the ThreadContext.stashContext() here https://github.com/opensearch-project/OpenSearch/pull/9415/files#diff-518cfddb0699fc99133f7a7dd5dc2224465c903405ae4aa9a8440abd03d1a686R868

We can fix this case by case. But now the question is should we add a check at the framework level like when we are starting the span then check if the current span is already ended then ignore that and create a new span without the current parent. Capturing the currentSpan while creating the span also wont help because most of the cases the context is being propagated to a caller thread from the calling thread.

Your thoughts on this please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be more cases from the ClusterApplierService.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking @Gaganjuneja , so indeed we don't manage context properly. So we need to make very clear distinction between spans and scopes. The OpenTelemetry models it differently but we don't: the spans could outlive the scopes, but the scopes provide this short-living context propagation (I will share the idea and will help you to deal with API changes):

            // Creates a Span that is not associated with the thread (it is not attached to anything)
            final Span httpRequestSpan = tracer.startSpan(SpanBuilder.from(httpRequest));
            // Attaches the span to current thread (scope)
            final (final SpanScope scope = tracer.spanInScope(httpRequestSpan)) {
                HttpChannel traceableHttpChannel = new TraceableHttpChannel(httpChannel, httpRequestSpanScope);
                handleIncomingRequest(httpRequest, traceableHttpChannel, httpRequest.getInboundException());
            } 
            // When scope is closed, the previously attached span is restored, but the current span is not closed

With this model (span + scope), we will be able to cover any async and reactive context propagation. On the side note, OpenTelemetry has pretty good debugging capabilities to help validating the spans / scopes lifecycle:

-Dio.opentelemetry.context.enableStrictContext=true

We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Scope should be closed and cleared out the current span from the current context. Span could be ended later on once the aync operation is completed.

-Dio.opentelemetry.context.enableStrictContext=true

Yes, we also took the idea of Strict check from here only.

We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.

MockXxx implementation we had written because OTEL is a plugin and we need to take the otel dependency inside the test framework and in future if multiple telemetry implementations are there then it would be difficult to do.

Let me also think about it and share the thoughts.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Sep 28, 2023
@ticheng-aws
Copy link
Contributor

Hi @Gaganjuneja, the PR is stalled. Is this being worked upon? Feel free to reach out to maintainers for further reviews.

@ticheng-aws ticheng-aws added the bug Something isn't working label Jan 6, 2024
@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jan 10, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Feb 17, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 24, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG][Tracing Framework] Wrongly setting the current span.
3 participants