Add flag to throttle ingestion requests per minute #259

sagar-infinitus-ai · 2020-12-15T21:22:51Z

We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.

When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.

Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:

QueueManager.updateShardsLoop
"Currently resharding, skipping"
QueueManager.calculateDesiredShards

At this point, there was no other option than to restart the whole pod (prometheus-server + stackdriver).

Is there anything we're missing? Is this situation recoverable other than by restarting the pod (and losing all unsent metrics)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to throttle ingestion requests per minute #259

Add flag to throttle ingestion requests per minute #259

sagar-infinitus-ai commented Dec 15, 2020

Add flag to throttle ingestion requests per minute #259

Add flag to throttle ingestion requests per minute #259

Comments

sagar-infinitus-ai commented Dec 15, 2020