-
-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interface to collect necessary stats for building a k8s readiness probe. #1122
Comments
Hi, indeed, we expose a Prometheus endpoint for this usage, and metrics about workers will be available in the next release: #966. |
Great! Thanks for the pointers! Those additions are timely for me :) Is it possible to get:
|
Thanks, I read that in the PR but I don’t see the above mentioned stats in there..am I missing something? I am essentially looking for some direct insight into the backlog (specifically the backlog of requests that will be handled by php workers…which may be impossible to know pre-emptively, I am unsure of the internals at play). |
It's possible to know, but not possible for metrics to know in any sort of stable way. For example, if your prometheus scrape happens to be between the time it gets sent to the workers and the time it gets picked up by the worker, it will be greater than zero. However, really, it is just "bad timing." If this is fine, then we can add it. But it may flap around quite a bit and not be useful, although I guess the "average trend" could be useful. |
Yeah understood. There are a lot of factors at play there that would/could dramatically. The avg time I mentioned could help mitigate the problems you have pointed out with the number…especially if direct values were exported putting the numbers in different time buckets/spans. There is definitely a lot of nuance to the numbers and which to show so further discussion is definitely warranted IMO. For example maybe a better number is how many requests could not be immediately dispatched in the last 1 seconds, 3 seconds, 5 seconds. Having said that, the ‘current number’ is still very useful and k8s provides the basic nobs to really help with that by giving the ability to set the frequency in which the check is run and the rise/fall numbers. Using those 3 pieces it is possible to create a liveness probe which very closely mimics an average over a desired period of time. I also think the numbers could prove incredibly valuable for really locking in hpa logic as well. Closely related to liveness but very distinct concepts. Thanks for the consideration! |
Looking at how things work, this would put metrics directly into the "hot path" of request handling, which I would like to avoid. However, we can passively detect whether workers are stalling (requests are coming in faster than workers can respond) and how bad it is. So what about a metric like:
Where the number shows a % over the last 1-5 seconds. This number is a representation of how full the worker request buffer is. In low utilization it is always 0. Once it goes above zero, latency tends to grow exponentially in my experiments. What do you think about that? |
I think that number is fantastic! Great idea. Which number goes into the hot path? Is it trying to gather the amount of time each of the stalled requests has been waiting? If I understand the number correctly I think there is 1 piece of missing context which would be incredibly useful (either a separate metric or encapsulated in a different metric/name altogether) which is the magnitude of the problem. I suppose this would only be relevant when the proposed stalled metric is 1..if I am 100% stalled, how big is the backlog? I am completely new to frankenphp so am a little foggy still about how the dispatching/queue works so may be off on my thinking so please correct my understanding as necessary. For example I don’t know if the requests are simply round robin’d to each worker as they come in OR if there’s more intelligence scheduling in front of the workers (ie: do I technically have 1 backlog or do I have N backlogs, 1 for each worker?) doing least connection style logic etc. |
#269 Linking to related issue. |
@travisghansen it looks like this is already doable due to a 'logic bug'.
|
Describe you feature request
Is your feature request related to a problem? Please describe.
I am very new to frankenphp so could certainly have missed something in the docs that allow me to do this.
I am trying to deploy frankenphp to k8s. I would like to sanely configure
readiness
probes to keep instances from getting overloaded with requests. I think this would predominantly come down to having 2 key metrics:Additional metrics may be good:
Describe the solution you'd like
I would like to have some sort of interface (http, cli that can be invoked, etc) which would export the above and perhaps more. I would then write a script to be executed which would retrieve that info, allow for some threshold of queued requests waiting for a thread, and if above the threshold k8s would stop sending traffic to that instance until things settle a bit.
Ideally the metrics would exclude any requests that are for non-php workers (ie: static files, etc).
Perhaps the caddy admin port already has some of this data?
Describe alternatives you've considered
I can have an endpoint in my app, but that seems less than ideal as each check would itself take up a thread/worker.
The text was updated successfully, but these errors were encountered: