-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI/CD: producing long running traces #1648
Comments
When the controller restarts emitting spans that do not match the actual duration of the tasks they represent and linking them using span links due to an inability of having a common parent span is suboptimal. I have also observed this issue in the Jenkins OpenTelemetry plugin. As mentioned in the CICD SIG call of the 28th November 2024, this seems to be a limitation of the OpenTelemetry Go SDK and might be shared by other OpenTelemetry SDKs. The requirements necessary to solve the issue are:
According to specification for the SDKs https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation there should be an optional parameter for Checking the Golang SDK, this should be possible when creating spans using the In the implementation of the Github actions event receiver the opentelemetry-collector receives the Github actions events via webhook. The events have Regarding stable trace IDs my quick search did not find how to do it in the SDK. However given that tracing across microservices using http headers for context propagation works, there should be a way to provide the context in the SDK (eg. like reading it from a request header). For the sending of all ongoing spans when the controller restarts, I'm not sure if this behavior can be changed not to send the ongoing spans or whether spans would have to always be created with custom start time just before ending them. What impact would this have on child span context propagation? In conclusion: It might be possible to implement resilient tracing for long running jobs without spec or SDK changes. However further investigation is needed to verify this or to discover specific issues in the SDK. |
As long as you can store a trace ID, span ID, and start time, you can probably just generate the entire span when it ends. You can set a custom start time for spans using the https://pkg.go.dev/go.opentelemetry.io/otel/trace#WithTimestamp option on start. The only remaining thing you would need to be able to do is to set the trace and span id. The only way I can think of to do this today is by providing a custom IDGenerator via https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#WithIDGenerator. The IDGenerator has access to the context, which you could use to pass the trace ID and span ID you want to use. Don't pass it in using trace.ContextWithSpanContext, since then it will be used as the parent context. You will want to write your own ContextWithHardCodedSpanContext (or similar) to pass it in explicitly through the context. As for prior art, I and others in the K8s instrumentation SIG explored storing the span context in object annotations as part of kubernetes/enhancements#2312, but didn't move forward with the proposal. |
Notes from SemCon meeting 2024-12-09There are 2 ways to think about tracing:
When tracing was designed (leading to the dapper paper?) were mostly around short running operations. Long running traces can be problem:
There's similar concerns also with people who like to mark errors on traces. And so if you have like a message pass queue in the middle, and the message pass queue errors. You'd mark the whole traces in there. The other thing is with CICD: I don't think you have a large trace problem with this. You just have a long trace problem. Right? Like, your traces are no bigger than like a distributed, complex distributed system. They just take a long time because some builds can be. At least, if you're doing the horrible build for ARM emulation thing we were doing in Github could be like 4h, for no reason outside of emulating your ARM. Build right, and you have to deal with that. But the trace isn't any bigger. So I think you actually have an interesting use case here with long running traces where you're not pushing against concerns with like tail sampling, where we have to keep giant amounts of data in memory for very long amount of time. I thought that it might be possible even to solve with current sdks (we can start a trace or span with an earlier start time and that it should be feasible to pass a given context like for http header passing context). You. You probably can. Yeah, it's probably just not ergonomic. You definitely could do it from Java. I know that I'm pretty sure you can do it from go as well. |
Area(s)
area:cicd
Is your change request related to a problem? Please describe.
This is not strictly a semantic-convention discussion, but it's come about because of trying to produce traces for CI/CD and the sem-conv CI/CD WG is the closest place for a home for this discussion.
The problem I am trying to solve is producing spans of large durations, from a kubernetes controller which may restart.
I'm specifically trying to solve this in argo-workflows, and I gave a talk at argocon (slides) which gives a lot of context to this. Argo Workflows can and is used as a CI/CD tool, but also for many other batch and machine learning scenarios where traces of the flow of a workflow would be useful. I'm pretending that the trace is the lifespan of a workflow, which may be several hours (some run for days). There are child spans representing the phase of the workflow and the individual nodes within the DAG that is being executed.
Argo Workflows
The workflow controller is best thought of here as a kubernetes operator, in this case running something not that far removed from a kubernetes job, just that the job is a DAG rather than a single pod. In the usual kubernetes controller manner, this controller is stateless, and can therefore restart at any time. Any necessary state is stored in the workflow Custom Resource. This is how the controller currently works.
I've used the GO SDK to add OTLP tracing to Argo Workflows, which works unless the controller restarts.
Ideas
Delayed span transmission.
My initial thought
My problem seemed to be that the opentelemetry SDK does not allow me to resume a Span. I'm using the golang SDK, but this is fundamental to the operation of spans. Once a Tracer is shutdown it will end all spans, and that's the end of my trace and spans. The spans will get transmitted as ended.
I therefore supposed you could put the SDK or a set of traces/spans into a mode where the end of the span didn't get transmitted at shutdown, and instead the span could be stored outside of the SDK to be transmitted later by the resumed controller once it started up. The SDK could then also facilitate "storing a span to a file".
This is possibly implementable right now with enough hacking, I haven't managed to get time for it.
Links
I could use span links and create multiple traces for a single workflow. This would crudely work, but I'd argue against it unless the presentation layer can effectively hide this from users.
The current UI for argo-workflows already has a basic view showing a timeline of a workflow for
This won't display or concern the user with controller restarts.
The target audience for these spans is probably somewhat less technical than the existing audience for http microservice tracing. Having to explain why their trace is in multiple parts and that they'll just have to deal with it isn't ideal. Span Metrics are a valuable tool here, and they'll be much more complicated or impossible in some cases.
It may be that the presentation layer can hide this - I have limited exposure to the variety of these in the market and how they deal with links.
Events
We could emit events for span start and end, and make it the "collectors" problem to correlate these and create spans. This is how Github Action tracing works - thanks to @adrielp for telling me about this.
For this to work something has to retain state - either the "End Span" event contains enough to construct the whole span (e.g. Start time) or the collector has to correlate a start event and end event, so the start event needs storage. The collector storing state would be wrong - in a cloud native environment I'd not even expect the same collector to receive the
end
as received thestart
. I'm trying not to be opinionated on how you configure your collector, but maybe we have to be.Protocol changes
A different approach to delaying span transmission, but with a similar goal.
We could change the protocol to allow span end to be "ended due to shutdown", and then allow a future span end for the same
span_id
to end it properly. This probably just pushes the problem onwards to the collector or the storage backend to do correlation in a similar way to events, so isn't an improvement.Describe the solution you'd like
I don't work in the telemetry business, and so I'm sure I'm missing other prior art.
I'm open to any mechanism to solve this, and would prefer we came up with a common pattern for this and other longer span/restartable binary problems. I believe these problems will also be there in some of the android/mobile and browser implementations, to which I have little visibility or understanding. Some of these proposed solutions may not work there, so coming up with a common solution which works for my use case and these would be ideal.
I hope this sparks a useful discussion.
The text was updated successfully, but these errors were encountered: