-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: always output config reloader file even if application is not ready #1076
base: TheSpiritXIII/config-ready
Are you sure you want to change the base?
feat: always output config reloader file even if application is not ready #1076
Conversation
6c4038c
to
42d972c
Compare
c2dac78
to
2ba5c2e
Compare
I am a little confused, given configuration reloading can happen at any time duration of the Prometheus runtime. We also have the initContainer ensuring an empty config is present before Prometheus starts to prevent transient crashlooping. Can you elaborate on why this is necessary or why the other PR is failing without this? |
42d972c
to
28e6cf4
Compare
2ba5c2e
to
de41003
Compare
@pintohutch thanks, I expanded on the PR description. Please let me know if it doesn't make sense! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
I think the description is out-dated. We decided to reload by restarting internal rule manager, so this PR is not strictly blocking security hardening. HOWEVER I fully agree we should do this for other reasons (logic and faster scrape/recording).
I disagree with the way it's done here, I proposed some alternatives (:
cc @dashpole for knowledge sharing 👍🏽 and @bernot-dev for opinions (:
Skipping this for 0.14 (if my understanding is correct, we don't need this)
tempReloader := newReloader(nil, &url.URL{ | ||
Scheme: "http", | ||
// Since Prometheus is not ready, we won't be able to hit the reload URL so hit itself. | ||
Host: listenAddr.String(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I would advocate to stop hacking like that (: Bit too much.
All the reloader code is upstream. I technically maintain it with Thanos team. Either we fork/vendor that code or we fix there, this is too hacky.
On top of that literally nothing technically fails (as is - stops working) if you just run normal reloader as we have now (without creating new) even if Prometheus is down. The reloading will retry, increment metric and log error. - expected if Prometheus is slowly starting.
To me all we need to do is either:
- Actually do what has been proposed 1y+ ago, just remove readiness check. Same arguments why it's not necessary apply now.
- We could keep readiness check but start reloader early in the background if we want readiness crash (I don't think that's useful)
- Change reloader upstream, so there is a lib function that only generates config, easy.
- Embed config-reloader code to rule-eval.
This change causes the config-reloader to start running before Prometheus is running.
The problem is that the config-reloader never runs if the Prometheus instance starts crash-looping. This can create a small delay before Prometheus actually starts scraping metrics. Since the config-reloader won't update configs until Prometheus is ready, the config-reloader today has to wait for the Prometheus instance to recover first: the Prometheus instance will startup with the old configuration, and only after will be updated with the new configuration.
When we move flags to the configuration-side, this becomes critical. This is because the rule-evaluator will not startup without the appropriate configuration. Since it never starts up, the config-reloader will never write the configuration. They are essentially both waiting for each other, like a deadlock. Without this, this PR fails: #1059
As mentioned earlier, the config-reloader must hit a reload URL. As a "hack", I made it hit a readiness endpoint served by itself that functionally does nothing. When Prometheus comes up, it starts hitting the Prometheus' readiness endpoint.