You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.
When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.
Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:
We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.
When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.
Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:
At this point, there was no other option than to restart the whole pod (prometheus-server + stackdriver).
Is there anything we're missing? Is this situation recoverable other than by restarting the pod (and losing all unsent metrics)?
The text was updated successfully, but these errors were encountered: