Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement batch submission for traces #469

Merged

Conversation

nikita-tkachenko-datadog
Copy link
Collaborator

@nikita-tkachenko-datadog nikita-tkachenko-datadog commented Nov 7, 2024

Requirements for Contributing to this repository

  • Fill out the template below. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion.
  • The pull request must only fix one issue at the time.
  • The pull request must update the test suite to demonstrate the changed functionality.
  • After you create the pull request, all status checks must be pass before a maintainer reviews your contribution. For more details, please see CONTRIBUTING.

What does this PR do?

Adds batch submission of traces for Agentless and Agentfull modes

Traces are submitted in batches instead of each event being submitted in a separate HTTP request.
For now batching is disabled by default, and can be enabled with the DD_JENKINS_ENABLE_TRACES_BATCHING environment variable set to true.

Extends diagnostic flare with health stats for async submitters

Various metrics are added, such as:

  • number of submitted logs and traces
  • number of dropped logs and traces
  • logs and traces batch size histogram
  • timing of logs and traces submit and dispatch

Increases the default batch size for logs and traces

Batch size is increased from 100 to 500, as stress testing showed that larger batch sizes increase throughput.

Updates the default behaviour of the async submission of logs and traces to not block

When a log or a trace element is submitted, it is added to a queue that is processed asynchronously by a separate thread.
The queue has a limited size, so it is possible that it's full when a submission takes place.

The old logic blocked the submitting thread in such cases (5s for logs, 30s for traces).
Blocking the thread has a disadvantage: it slows down the execution of pipelines.

The new logic will drop submitted log/trace when the queue is full.

The queue is there to account for short bursts of logs/traces.
If it gets full than it is likely that the consumer is consistently slower than the producer, which will lead to severe performance degradation in case of blocking on submission.

Description of the Change

Alternate Designs

Possible Drawbacks

Verification Process

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

  • Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
  • PR should not have do-not-merge/ label attached.
  • If Applicable, issue must have kind/ and severity/ labels attached at least.

@nikita-tkachenko-datadog nikita-tkachenko-datadog added the changelog/Fixed Fixed features results into a bug fix version bump label Nov 7, 2024
@nikita-tkachenko-datadog nikita-tkachenko-datadog marked this pull request as ready for review November 14, 2024 19:19
@nikita-tkachenko-datadog nikita-tkachenko-datadog changed the title Fix logs and traces async submission to avoid blocking jobs execution when lagging behind Implement batch submission for traces Nov 15, 2024
Copy link
Collaborator

@drodriguezhdez drodriguezhdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments

pom.xml Show resolved Hide resolved
import java.util.logging.Logger;
import java.util.zip.GZIPOutputStream;

public class CompressedBatchSender<T> implements JsonPayloadSender<T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to we have unit tests for this class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, added unit tests

@@ -176,7 +174,7 @@ private static boolean postLogs(HttpClient httpClient, String logIntakeUrl, Secr

byte[] body = payload.getBytes(StandardCharsets.UTF_8);
try {
httpClient.postAsynchronously(logIntakeUrl, headers, "application/json", body);
httpClient.post(logIntakeUrl, headers, "application/json", body, Function.identity());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is postLogs private static method only used in validateLogIntakeConnection method?

If so, should we add all the logic to the validateLogIntakeConnection method? It was unexpected to me read postLogs and see that's only invoked in the validation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think it was used by some other logic in the past, now it's only used for validation. Inlined the method.


@Override
public int order() {
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have in a static variable the order of all FlareContributor. Currently, I don't know if the number is used in other places without checking all the impls for FlareContributor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that much of a problem if the order is the same, this only affects sorting in UI. Still a valid point, added some named constants for the orders.

private static final int DEFAULT_QUEUE_CAPACITY = 5_000;
private static final int DEFAULT_SUBMIT_TIMEOUT_SECONDS = 5;
private static final int DEFAULT_QUEUE_CAPACITY = 10_000;
private static final int DEFAULT_SUBMIT_TIMEOUT_SECONDS = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of no timeout, infinite timeout??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opposite actually, with 0 timeout any trace/log that does not fit into the queue will be dropped immediately.
The idea is to avoid blocking the client's code and the Jenkins core. If the queue fills up, it means traces/logs are being produced faster than they're being dispatched, and a timeout will not resolve this problem, just postpone it.

Comment on lines +37 to +41
private final Timer submit;
private final Meter submitDropped;
private final Timer dispatch;
private final Gauge<Integer> queueSize;
private final Histogram batchSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for internal usage right? We're not reporting these metrics to Datadog.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this only shows up in diagnostic flares for now. It'd be nice to integrate this with actual telemetry, but this is too much of an effort for now.

@nikita-tkachenko-datadog nikita-tkachenko-datadog merged commit 4687cf3 into master Nov 20, 2024
19 checks passed
@nikita-tkachenko-datadog nikita-tkachenko-datadog deleted the nikita-tkachenko/fix-async-submit-blocks branch November 20, 2024 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/Fixed Fixed features results into a bug fix version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants