Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: always output config reloader file even if application is not ready #1076

Open
wants to merge 1 commit into
base: TheSpiritXIII/config-ready
Choose a base branch
from

Conversation

TheSpiritXIII
Copy link
Member

@TheSpiritXIII TheSpiritXIII commented Jul 16, 2024

This change causes the config-reloader to start running before Prometheus is running.

The problem is that the config-reloader never runs if the Prometheus instance starts crash-looping. This can create a small delay before Prometheus actually starts scraping metrics. Since the config-reloader won't update configs until Prometheus is ready, the config-reloader today has to wait for the Prometheus instance to recover first: the Prometheus instance will startup with the old configuration, and only after will be updated with the new configuration.

When we move flags to the configuration-side, this becomes critical. This is because the rule-evaluator will not startup without the appropriate configuration. Since it never starts up, the config-reloader will never write the configuration. They are essentially both waiting for each other, like a deadlock. Without this, this PR fails: #1059

As mentioned earlier, the config-reloader must hit a reload URL. As a "hack", I made it hit a readiness endpoint served by itself that functionally does nothing. When Prometheus comes up, it starts hitting the Prometheus' readiness endpoint.

@TheSpiritXIII TheSpiritXIII force-pushed the TheSpiritXIII/config-ready branch 2 times, most recently from 6c4038c to 42d972c Compare July 17, 2024 14:56
@TheSpiritXIII TheSpiritXIII force-pushed the TheSpiritXIII/config-output branch from c2dac78 to 2ba5c2e Compare July 17, 2024 19:22
@TheSpiritXIII TheSpiritXIII marked this pull request as ready for review July 31, 2024 14:02
@pintohutch
Copy link
Collaborator

I am a little confused, given configuration reloading can happen at any time duration of the Prometheus runtime. We also have the initContainer ensuring an empty config is present before Prometheus starts to prevent transient crashlooping.

Can you elaborate on why this is necessary or why the other PR is failing without this?

@TheSpiritXIII TheSpiritXIII force-pushed the TheSpiritXIII/config-output branch from 2ba5c2e to de41003 Compare August 2, 2024 14:38
@TheSpiritXIII
Copy link
Member Author

@pintohutch thanks, I expanded on the PR description. Please let me know if it doesn't make sense!

Copy link
Collaborator

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

I think the description is out-dated. We decided to reload by restarting internal rule manager, so this PR is not strictly blocking security hardening. HOWEVER I fully agree we should do this for other reasons (logic and faster scrape/recording).

I disagree with the way it's done here, I proposed some alternatives (:

cc @dashpole for knowledge sharing 👍🏽 and @bernot-dev for opinions (:

Skipping this for 0.14 (if my understanding is correct, we don't need this)

tempReloader := newReloader(nil, &url.URL{
Scheme: "http",
// Since Prometheus is not ready, we won't be able to hit the reload URL so hit itself.
Host: listenAddr.String(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I would advocate to stop hacking like that (: Bit too much.

All the reloader code is upstream. I technically maintain it with Thanos team. Either we fork/vendor that code or we fix there, this is too hacky.

On top of that literally nothing technically fails (as is - stops working) if you just run normal reloader as we have now (without creating new) even if Prometheus is down. The reloading will retry, increment metric and log error. - expected if Prometheus is slowly starting.

To me all we need to do is either:

  1. Actually do what has been proposed 1y+ ago, just remove readiness check. Same arguments why it's not necessary apply now.
  2. We could keep readiness check but start reloader early in the background if we want readiness crash (I don't think that's useful)
  3. Change reloader upstream, so there is a lib function that only generates config, easy.
  4. Embed config-reloader code to rule-eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants