-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix updating the parent span #9450
base: main
Are you sure you want to change the base?
Fix updating the parent span #9450
Conversation
Signed-off-by: Gagan Juneja <[email protected]>
Signed-off-by: Gagan Juneja <[email protected]>
Signed-off-by: Gagan Juneja <[email protected]>
Compatibility status:Checks if related components are compatible with change bf10ff7 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git] |
Gradle Check (Jenkins) Run Completed with:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9450 +/- ##
============================================
- Coverage 71.17% 71.08% -0.10%
- Complexity 57451 57459 +8
============================================
Files 4776 4776
Lines 270795 270798 +3
Branches 39580 39581 +1
============================================
- Hits 192743 192501 -242
- Misses 61852 62061 +209
- Partials 16200 16236 +36 ☔ View full report in Codecov by Sentry. |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@reta Please take a look. |
@@ -77,7 +77,11 @@ public SpanContext getCurrentSpan() { | |||
private void endSpan(Span span) { | |||
if (span != null) { | |||
span.endSpan(); | |||
setCurrentSpanInContext(span.getParentSpan()); | |||
if (span.getParentSpan() != null && !span.getParentSpan().hasEnded()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gaganjuneja the thing you're trying to fix should never happen, I think we are managing context restoration incorrectly. If the execution is synchronous, thing are clear: the parent could not be finished before children, now the asynchronous case is different:
- the parent span and child spans do not belong to the same thread
- when the we start new span scope, we should capture the current span associated with the current thread (which could be not the same as the parent span)
- when the span is ended, we have to restore the span that was active on that particular thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reta for taking a look.
I had run a couple of tests and deep dive to understand the issue.
As of now Our context restoration in based on the fact that
OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java
Line 74 in 8cfde6c
* A ThreadContext is a map of string headers and a transient map of keyed objects that are associated with |
Every opensearch thread is managed by a thread pool or executor being responsible for stashing and restoring the threads context
. So ideally even we are setting the wrong/ended parent but it should be removed once the thread call is over and context is restored. One specific scenario where this is not happening is scheduled cluster monitoring actions.
Here if at Thread1 the context is not stashed the state will be persisted. This exactly happens with the NodeStatsTransportAction and possibly with other actions where startSpan through TransportService::sendRequest is creating the span and persisting with the current Management thread without stashing the context after the call is over.
We can fix this by adding the ThreadContext.stashContext() here https://github.com/opensearch-project/OpenSearch/pull/9415/files#diff-518cfddb0699fc99133f7a7dd5dc2224465c903405ae4aa9a8440abd03d1a686R868
We can fix this case by case. But now the question is should we add a check at the framework level like when we are starting the span then check if the current span is already ended then ignore that and create a new span without the current parent. Capturing the currentSpan while creating the span also wont help because most of the cases the context is being propagated to a caller thread from the calling thread.
Your thoughts on this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be more cases from the ClusterApplierService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking @Gaganjuneja , so indeed we don't manage context properly. So we need to make very clear distinction between spans and scopes. The OpenTelemetry
models it differently but we don't: the spans could outlive the scopes, but the scopes provide this short-living context propagation (I will share the idea and will help you to deal with API changes):
// Creates a Span that is not associated with the thread (it is not attached to anything)
final Span httpRequestSpan = tracer.startSpan(SpanBuilder.from(httpRequest));
// Attaches the span to current thread (scope)
final (final SpanScope scope = tracer.spanInScope(httpRequestSpan)) {
HttpChannel traceableHttpChannel = new TraceableHttpChannel(httpChannel, httpRequestSpanScope);
handleIncomingRequest(httpRequest, traceableHttpChannel, httpRequest.getInboundException());
}
// When scope is closed, the previously attached span is restored, but the current span is not closed
With this model (span + scope), we will be able to cover any async and reactive context propagation. On the side note, OpenTelemetry has pretty good debugging capabilities to help validating the spans / scopes lifecycle:
-Dio.opentelemetry.context.enableStrictContext=true
We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Scope should be closed and cleared out the current span from the current context. Span could be ended later on once the aync operation is completed.
-Dio.opentelemetry.context.enableStrictContext=true
Yes, we also took the idea of Strict check from here only.
We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.
MockXxx implementation we had written because OTEL is a plugin and we need to take the otel dependency inside the test framework and in future if multiple telemetry implementations are there then it would be difficult to do.
Let me also think about it and share the thoughts.
This PR is stalled because it has been open for 30 days with no activity. |
Hi @Gaganjuneja, the PR is stalled. Is this being worked upon? Feel free to reach out to maintainers for further reviews. |
This PR is stalled because it has been open for 30 days with no activity. |
This PR is stalled because it has been open for 30 days with no activity. |
This PR is stalled because it has been open for 30 days with no activity. |
Description
Tracing framework after ending the current span internally set the parent as a current span in the thread context. But after this change #8428 now parent can be ended even before the span so the assumption that parent should always be closed after child doesnt hold true. Now setting the already ended parent as a current span doesn't make sense. Adding parent as a current span only if it is still recording
Related Issues
Resolves #9449
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.