Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(spec): Add specification for output rate limiting #15665

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/specs/tsd-008-output-rate-limiting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Output rate limiting

## Objective

Allow to control the metric-rate sent by outputs

## Keywords

output plugins, rate limit, buffer

## Overview

Output plugins send metrics to their corresponding services respecting the
`metric_batch_size` and the `flush_interval` configured. While this works well
in most situations, special situations might occur where the output will send
a large number of metrics in a short time-span. E.g. when a large number of
metrics are gathered in a short amount of time by one or more inputs or when
reconnecting after a longer disconnect of an output from it's service.
In all of those cases a large number of batches are prepared and sent via the
output plugin to its service potentially overwhelming the service in turn with
the number of metrics sent.
Furthermore, use-cases exist where operators want to provision limited resources
to Telegraf and in turn want to control the data-rate to a service.

This specification intends to introduce an _optional_ rate limiting feature
configurable per output to gain control of the sending rate of output plugins.
Therefore, a new `metric_rate_limit` setting is proposed allowing to set the
maximum number of metrics sent __per second__ via an output. By default, the
metric rate must be unlimited.
Comment on lines +27 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that in many cases a cloud or database or service will put limits in terms of "MB per second". Here is some prior art from InfluxDB: influxdata/influxdb#19660

The logical next question from a user is "how do I know how many metrics translate to N MB per second".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this must be done in the output plugin as the message size depends on the serialization of the metrics and this is NOT known to the framework. The proposed solution is to limit the peak sends of metrics, not about server-side rate limiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this must be done in the output plugin as the message size depends on the serialization of the metrics and this is NOT known to the framework.

I have no disagreement with this and certainly wasn't implying moving this out of the output plugin.

The proposed solution is to limit the peak sends of metrics, not about server-side rate limiting.

Correct, but one of the reasons people want this option is the service they are sending to will have limits as to how much it can accept. For example, using the influxdb output, we know some free plans are limited to 17 kb/s. How can a user take that limit, and apply it their telegraf config?

I don't see customers or users asking for "limit to N metrics per second". Rather it is a data rate per second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am getting this convolluted by trying to solve two things at once:

  1. the scenario where telegraf reconnects to an output, and then sends a DOS-like set of metric batches to "catch up"
  2. the scenario where telegraf sends too many metrics over time to an output

Your proposal is clearly tackling the first, but I see users grabbing this option to also tackle 2 as well.

Can we consider 2 to avoid yet another option.


In case the specified rate limit is reached, a smaller batch satisfying the
limit is sent or, if no metrics are left, the write-cycle is skipped by the
output. The user should be informed in the logs if the rate limit applies.

## Caveats

It is important to note that setting a metric rate limit poses a severe
constraint for an output, so the feature should be used carefully. Please make
sure the configured metric rate limit exceeds the average input rate of metrics
gathered by inputs.
In case the limit is set too low, i.e. below the average rate metrics are
gathered by inputs, the output might not be able to sent the metrics fast
enough. In turn the metrics buffer will fill up and metrics are dropped.
Telegraf might not be able to recover from this situation in case the output
rate is permanently below the input rate.

## Related Issues

- [#15353](https://github.com/influxdata/telegraf/issues/15353) rate limiting processor proposal
Loading