Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload is called unnecessarily #342

Closed
sed-i opened this issue Aug 16, 2022 · 2 comments · Fixed by #352
Closed

Reload is called unnecessarily #342

sed-i opened this issue Aug 16, 2022 · 2 comments · Fixed by #352

Comments

@sed-i
Copy link
Contributor

sed-i commented Aug 16, 2022

Bug Description

Currently, every _configure, the charm would either reload configuration

if current_services == new_layer.services:
reloaded = self._prometheus_server.reload_configuration()

or replan:

else:
container.add_layer(self._name, new_layer, combine=True)

This means the prometheus service will be unavailable for potentially significant duration (e.g. big wal replay).

To Reproduce

NA

Environment

NA

Relevant log output

NA

Additional context

@sed-i
Copy link
Contributor Author

sed-i commented Sep 6, 2022

Also as part of this change we need to consider comments by @rbarry82 on #349:

"call _configure|_common_exit_hook hoping it will resolve BlockedStatus if an event was emitted before ingress is ready is a sledgehammer which is bad considering we already have an issue about /-/reload being called to often

Agreed, but to limit the scope of this change, and given #342 will be addressed very soon, can live with this for now?

what whatever logic it would take for "is BlockedStatus set during _update_status? Is it still valid? If it is, clear it, if not, move on) would be more explicit, cleaner, and more legible.

This would be in direct contrast to the common exit hook pattern. I'd rather not mix both approaches in the same charm. Also, once #342 is fixed this will be much less painful to look at.

seems wrong that # Reload may fail if ingress ... wold lead to a BlockedStatus(CORRUPT_PROMETHEUS_CONFIG_MESSAGE)

Agreed. I don't particularly like that status comparison pattern and it should be fixed as part of #342.

prometheus_server.Prometheus.* should provide distinctions between timeouts and ConnectionError to flag the "real" status.

Sounds presumptuous at first glance. And what will the charm do with the various exceptions? Just choose one message over another? Maybe, if we have insight, expose a retry setting the charm could use?

Originally posted by @sed-i in #349 (comment)

@rbarry82
Copy link
Contributor

rbarry82 commented Sep 7, 2022

Agreed, but to limit the scope of this change, and given #342 will be addressed very soon, can live with this for now?

A little self-referential, since this is #342 ;) Are we addressing it here then?

This would be in direct contrast to the common exit hook pattern. I'd rather not mix both approaches in the same charm. Also, once #342 is fixed this will be much less painful to look at.

I think that's a little bit the point -- the common exit hook cannot cleanly address this, but we actually do need to do so. There are multiple non-_configure places in the charm where we set BlockedStatus (in _generate_command, _on_k8s_...). "Un-setting" it isn't a massive divergence.

Agreed. I don't particularly like that status comparison pattern and it should be fixed as part of #342.

Why would it be changed there? It also affects this. There seems to be three cases where the charm may not be accessible:

  • Bad configuration (we can check with promtool if there's an exception raised trying to connect)
  • Timeout because it's replaying/mmaping HEAD chunks which haven't been written to the TSDB
  • Not reachable due to ingress

The first two are meaningful here.

Sounds presumptuous at first glance. And what will the charm do with the various exceptions? Just choose one message over another? Maybe, if we have insight, expose a retry setting the charm could use?

The exceptions could have a property with a message, or a lookup table in the charm, or whatever. The charm would set a status message to tell the administrator what is happening and why Prometheus may not be reachable which is more useful than a "configuration is invalid" message which may not be true or an ActiveStatus when Prometheus is actually not active/reachable.

A retry mechanism can only be so useful. For /-/reload, though, it won't be, and how we address that can certainly be discussed elsewhere (we already "know" that we'll probably have to sidestep it with @manadart's work to trigger async hooks), but we're well enough aware of the "basic" failure cases, and we need a way to flag the administrator about what the actual status is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants