Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Open
Hr0bar opened this issue Sep 17, 2024 · 12 comments
Open

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Hr0bar opened this issue Sep 17, 2024 · 12 comments
Assignees
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@Hr0bar
Copy link
Contributor

Hr0bar commented Sep 17, 2024

Use Case

Hi, we are using outputs.azure_monitor for more than 3 years now and only now we ran into this issue which triggered this feature request. Basically a global option to drop unsent metrics by a time limit instead of buffer size limit could be a solution, or output plugin specific solution could be added?

The issue we experienced:

For more than 30 minutes the Azure endpoint was Timeouting:

Since around Sep 13 07:00:00:
Sep 13 07:47:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:47:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:48:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:48:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:49:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:49:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:50:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:50:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:51:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:51:09Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": read tcp 172.CENSORED.CENSORED.CENSORED:43962->20.50.65.82:443: read: connection timed out

Then Azure endpoint became responsive again, but since it was down for more than 30 minutes, all outputs Telegraf has buffered are now rejected by Azure, resulting in indefinite loop of never sent metrics:

Sep 13 07:52:10 HOSTNAME.com telegraf[1305]: 2024-09-13T07:52:10Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:53:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:53:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:54:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:54:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:55:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:55:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}

...Forever until Telegraf restarts

For now we will drastically reduce the buffer size limit to avoid this. But wee add/remove metrics to monitoring all the time, its not feasible to recalculate proper buffer limits (metric_batch_size and metric_buffer_limit) every time to achieve desired unsent metrics time limit before dropping them, so this solution doesnt seem very "clean" ?

One extra note on documentation, we had to take a look at the Telegraf source code comments to find out an important info that oldest metrics are overwritten, which is important for us to know that reducing the buffer size will help us in most cases:

	// MetricBufferLimit is the max number of metrics that each output plugin
	// will cache. The buffer is cleared when a successful write occurs. When
	// full, the oldest metrics will be overwritten. This number should be a
	// multiple of MetricBatchSize. Due to current implementation, this could
	// not be less than 2 times MetricBatchSize.

The official docs say just:

metric_buffer_limit: Maximum number of unwritten metrics per output. Increasing this value allows for longer periods of output downtime without dropping metrics at the cost of higher maximum memory usage.

Which is not enough to make an educated configuration decision, perhaps it could be made more verbose ?

Expected behavior

Reliably configure to drop unsent metrics by time instead of buffer sizes. We add/remove metrics to monitoring all the time, its not feasible to recalculate proper buffer limits (metric_batch_size and metric_buffer_limit) every time to achieve desired unsent metrics time limit before dropping them.

Alternatively, a force drop of old metrics could occur when
E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}

is detected in Azure output plugin, as a one plugin only solution.

Alternatively, perhaps a documentation update in Azure output plugin could be added and be enough, warning all users that aggressive buffer sizes should be used to avoid this issue. The rationale is that a global option to drop by time limit could be an overkill / not warranted enough, but that is for the Telegraf project to decide :)

Actual behavior

Azure metrics / possibly other output destinations will reject metrics older than certain time limit. During network/other issues resulting into outputs not being sent, Telegraf will buffer metrics older than that time limit indefinitely in a loop where metrics older than the time limit are scheduled to be sent (at the top of the stack) indefinitely.

Additional info

Similar to #13273

@Hr0bar Hr0bar added the feature request Requests for new plugin and for new features to existing plugins label Sep 17, 2024
@Hr0bar
Copy link
Contributor Author

Hr0bar commented Sep 17, 2024

Actually just found #14928 where it says:

Currently, when a Telegraf output metric queue fills, either due to incoming
metrics being too fast or various issues with writing to the output, new
metrics are dropped and never written to the output.

While in code comments we see this:

	// MetricBufferLimit is the max number of metrics that each output plugin
	// will cache. The buffer is cleared when a successful write occurs. When
	// full, the oldest metrics will be overwritten. This number should be a
	// multiple of MetricBatchSize. Due to current implementation, this could
	// not be less than 2 times MetricBatchSize.

Im not so sure anymore that the workaround with aggressive buffer limit will help now, since its not very clear whether old are dropped, or old kept and new dropped, in which case aggressive buffer limit wont help.

@srebhan
Copy link
Member

srebhan commented Oct 2, 2024

The code comment is correct and the spec needs correction. Are you willing to put up a PR to fix the spec wording? I can confirm that actually the oldest metrics in the buffer are dropped (overwritten by new metrics) first.

@srebhan
Copy link
Member

srebhan commented Oct 2, 2024

Regarding your issue, I'm preparing a new way of selectively dropping metrics on outputs and I want to hold this up until this is ready. Is that OK for you?

@Hr0bar
Copy link
Contributor Author

Hr0bar commented Oct 2, 2024

Thanks, yes we are fine waiting for some more comprehensive solution. Ill have a look at PR for wording in the spec (and possibly also expanding the wording in "metric_buffer_limit" description in docs, perhaps even azure output plugin docs could use some mention of the Azure limit, some Azure users may not be aware).

@srebhan
Copy link
Member

srebhan commented Oct 16, 2024

@Hr0bar would love to get your feedback on PR #16034!

@Hr0bar
Copy link
Contributor Author

Hr0bar commented Oct 18, 2024

Well thats one way to solve it - by dropping all rejected metrics only, that would certainly work for this use case as well (rather than relying on time based drop).

But... I would assume this could be difficult to generalize for many output plugins, as I imagine some external systems (in this case Azure) reject whole batch, some provide info what specifically they rejected and why, some dont etc.

No idea how Azure monitor output behaves for example... is it sent one by one metric? or full batch? What happens when only one metric in full batch is rejected ? Do we drop whole batch (dont think so)? How do we know which one caused the reject by service error if its only one out of many?

@srebhan
Copy link
Member

srebhan commented Oct 18, 2024

Did you check the spec? It extends the current batch handling to be able to do exactly this, drop specific metrics of a batch or drop the whole batch or do whatever the output requires. Currently, you can either accept a batch (removing it from the buffer) or keep the batch (requeueing the whole thing on next write)...

@Hr0bar
Copy link
Contributor Author

Hr0bar commented Oct 21, 2024

Yes, understood that, and yes that would be a working solution :)

I was just thinking about how that solution can be achieved as someone with no understanding of how it currently works ( I was under the impression that now whole batch is sent as a single payload to output destination - and theorizing whether if that would be the case if it needs to be split and sent metric by metric instead to output destination, to know exactly which metrics get rejected one by one so we can drop only specific metrics instead of whole batch)

@srebhan
Copy link
Member

srebhan commented Nov 6, 2024

@Hr0bar I'm looking into this issue currently and I wonder how we know what the filter time should be? In your error message the valid time-range is 30 min ago and 4 min in the future but I cannot find that anywhere in the Azure documentation. Should this be a config option?

@Hr0bar
Copy link
Contributor Author

Hr0bar commented Nov 7, 2024 via email

@srebhan
Copy link
Member

srebhan commented Nov 11, 2024

So I think making this a config option is the right thing to do then with 20 min in the past and 4 min in the future (i.e. the intersection of both time-sets ;-)). Parsing this from the response might be another option but I do worry that this will be a fragile foot-gun...

@Hr0bar
Copy link
Contributor Author

Hr0bar commented Nov 11, 2024

Agree, its likely not stable on Azure side (both the values & possible error msg string), so configurable with conservative intersection default makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

2 participants