Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify how export retry should be implemented #4138

Open
brettmc opened this issue Jul 10, 2024 · 4 comments
Open

Clarify how export retry should be implemented #4138

brettmc opened this issue Jul 10, 2024 · 4 comments
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@brettmc
Copy link
Contributor

brettmc commented Jul 10, 2024

What are you trying to achieve?

I am trying to define exporter retry config in file-based configuration: open-telemetry/opentelemetry-configuration#97

The spec says that the client SHOULD implement an exponential backoff strategy between retries, but doesn't say anything further.

The consequence of this is that multiple SIGs have implemented an exponential backoff retry strategy, in numerous ways, and using a variety of inputs. Ultimately, I think it means we cannot have a common way to configure the retry strategy of an exporter, because we don't have a common set of inputs.

This is particularly problematic for automatic configuration via file-based configuration, which is trying to unify and extend SDK configuration beyond what is possible with environment variables. It's also problematic for environment variable-based configuration, although less so because SIGs are able to implement their own variables using the language-specific OTEL_<LANG>_FEATURE.

What I would like to see is that we decide whether a retry strategy should be configurable, and if so work out if it's possible to agree a common set of parameters to control the strategy, so that we can expose to users a mechanism to control the behaviour.

Alternatively, since we do have [https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#timeout-configuration](timeout configuration), do we make that (ie, the total time spent) the only input variable, and allow language implementations to choose their own, non-configurable, implementation provided they observe that one deadline?

Related issues:

Additional context.

Some examples of the inputs to retry strategy from different SIGs:

Go:

Enabled bool
InitialInterval time.Duration
MaxInterval time.Duration
MaxElapsedTime time.Duration

PHP:

initial_delay
max_retries

Python:

max_value (for an exponentially increasing delay time)

Java:

maxAttempts
initialBackoff
maxBackoff
backoffMultiplier

Java mentions specifically in docs that there is "currently no way to customize [the retry policy]": https://github.com/open-telemetry/opentelemetry-java/blob/9543a3451851d82f2f9d8e1b0cd78e2cc133b1a5/sdk-extensions/autoconfigure/README.md#otlp-exporter-retry

@brettmc brettmc added the spec:trace Related to the specification/trace directory label Jul 10, 2024
@danielgblanco danielgblanco added the triage:deciding:tc-inbox Needs attention from the TC in order to move forward label Jul 15, 2024
@carlosalberto
Copy link
Contributor

IIRC we left that like that to allow SIGs to implement whatever made sense in the language itself, but we should definitely try to (at least) standardize a few params, such as maxBackoff and enabled (or strategy=exponential_retry|none, to support future strategies?)

@brettmc
Copy link
Contributor Author

brettmc commented Jul 20, 2024

I really like the idea of a retry policy being an interface, so that it can be expanded in future with either official or contrib/3rd party policies. Perhaps spec begins with both exponential_retry and none implementations of a retry policy? I feel that having a none policy implementation is a cleaner interface than each policy needing to implement a disabled?

@jack-berg
Copy link
Member

Also related to #1742

@jack-berg
Copy link
Member

We discussed in the 8/7/14 and 8/14/24 TC meetings. I wrote a document summarizing a number of somewhat overlapping OTLP retry issues, and sketching out some proposals on how to fix them.

There was consensus on a number of things:

However, on this specific issue, its not clear that the diverging implementations of the retry backoff algorithm are enough of a problem to justify the time investment needed to standardize. Most users are probably fine with the default backoff algorithm and won't touch whatever configuration options each implementation provides. Adding "community feedback" label to this issue to solicit more info about whether the current situation is problematic enough to address.

@jack-berg jack-berg added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Aug 14, 2024
@github-actions github-actions bot added the triage:followup Needs follow up during triage label Sep 19, 2024
@mtwo mtwo removed the triage:followup Needs follow up during triage label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests

5 participants