-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make otel scheduler sync #2262
base: main
Are you sure you want to change the base?
Make otel scheduler sync #2262
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OTel contract states that Start must return quickly, so I suppose sync would be fine. Do we know if the Collector does this async or sync? It'd be best to just do what it does, just to be on the safe side.
Regarding the issue with the batch processor - does it happen because something is wrong on the OTel side, or because #2027 maybe doesn't work for some edge case?
It might be good for @thampiotr to opine here, given that this PR reverts changes in #2027. |
It's done sync in the otel collector. The service starts everything sync (https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/service.go#L230) and it's called sync here (https://github.com/open-telemetry/opentelemetry-collector/blob/main/otelcol/collector.go#L228)
Piotr and I had a discussion about it here: https://raintank-corp.slack.com/archives/C02GSU8SHBN/p1733846257348439 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a changelog entry? If it's a problem end users could see, then mentioning the error in the changelog could indicate to them that they shouldn't see it again.
It doesn't revert all the changes, it's reusing the approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I like it more. Thanks for fixing @wildum 🚀
// A message is already queued for refreshing running components so we | ||
// don't have to do anything here. | ||
} | ||
level.Debug(cs.log).Log("msg", "scheduling components", "count", len(cc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level.Debug(cs.log).Log("msg", "scheduling components", "count", len(cc)) | |
level.Debug(cs.log).Log("msg", "scheduling otelcol components", "count", len(cc)) |
It's quite an ambiguous log line, but if you think there is value in keeping it I'm happy to have it as a debug log.
* Don't start components until Run is called * Update consumers after stopping the component * Minor fixes
The fix in #2027 did not address all the race conditions that could happen with the scheduler. We noticed it when getting rid of our Otel fork brought back the flakiness of a batch processor test.
With this change, the scheduling of otel components is not an async operation anymore. This simplifies the logic and ensures that the components are started before they can start consuming.