Reload is called unnecessarily #342

sed-i · 2022-08-16T15:30:09Z

Bug Description

Currently, every _configure, the charm would either reload configuration

Lines 272 to 273 in e75eb0b

    
           if current_services == new_layer.services: 
        
               reloaded = self._prometheus_server.reload_configuration()

or replan:

prometheus-k8s-operator/src/charm.py

Lines 281 to 282 in e75eb0b

    
           else: 
        
               container.add_layer(self._name, new_layer, combine=True)

This means the prometheus service will be unavailable for potentially significant duration (e.g. big wal replay).

To Reproduce

NA

Environment

NA

Relevant log output

NA

Additional context

This came up while adding a call to _configure in update-status as a workaround for ready_for_unit is emitted too early traefik-k8s-operator#78.
Note: WAL replay can take longer than timeout duration.

The text was updated successfully, but these errors were encountered:

sed-i · 2022-09-06T00:06:09Z

Also as part of this change we need to consider comments by @rbarry82 on #349:

"call _configure|_common_exit_hook hoping it will resolve BlockedStatus if an event was emitted before ingress is ready is a sledgehammer which is bad considering we already have an issue about /-/reload being called to often

Agreed, but to limit the scope of this change, and given #342 will be addressed very soon, can live with this for now?

what whatever logic it would take for "is BlockedStatus set during _update_status? Is it still valid? If it is, clear it, if not, move on) would be more explicit, cleaner, and more legible.

This would be in direct contrast to the common exit hook pattern. I'd rather not mix both approaches in the same charm. Also, once #342 is fixed this will be much less painful to look at.

seems wrong that # Reload may fail if ingress ... wold lead to a BlockedStatus(CORRUPT_PROMETHEUS_CONFIG_MESSAGE)

Agreed. I don't particularly like that status comparison pattern and it should be fixed as part of #342.

prometheus_server.Prometheus.* should provide distinctions between timeouts and ConnectionError to flag the "real" status.

Sounds presumptuous at first glance. And what will the charm do with the various exceptions? Just choose one message over another? Maybe, if we have insight, expose a retry setting the charm could use?

Originally posted by @sed-i in #349 (comment)

rbarry82 · 2022-09-07T10:00:57Z

Agreed, but to limit the scope of this change, and given #342 will be addressed very soon, can live with this for now?

A little self-referential, since this is #342 ;) Are we addressing it here then?

This would be in direct contrast to the common exit hook pattern. I'd rather not mix both approaches in the same charm. Also, once #342 is fixed this will be much less painful to look at.

I think that's a little bit the point -- the common exit hook cannot cleanly address this, but we actually do need to do so. There are multiple non-_configure places in the charm where we set BlockedStatus (in _generate_command, _on_k8s_...). "Un-setting" it isn't a massive divergence.

Agreed. I don't particularly like that status comparison pattern and it should be fixed as part of #342.

Why would it be changed there? It also affects this. There seems to be three cases where the charm may not be accessible:

Bad configuration (we can check with promtool if there's an exception raised trying to connect)
Timeout because it's replaying/mmaping HEAD chunks which haven't been written to the TSDB
Not reachable due to ingress

The first two are meaningful here.

Sounds presumptuous at first glance. And what will the charm do with the various exceptions? Just choose one message over another? Maybe, if we have insight, expose a retry setting the charm could use?

The exceptions could have a property with a message, or a lookup table in the charm, or whatever. The charm would set a status message to tell the administrator what is happening and why Prometheus may not be reachable which is more useful than a "configuration is invalid" message which may not be true or an ActiveStatus when Prometheus is actually not active/reachable.

A retry mechanism can only be so useful. For /-/reload, though, it won't be, and how we address that can certainly be discussed elsewhere (we already "know" that we'll probably have to sidestep it with @manadart's work to trigger async hooks), but we're well enough aware of the "basic" failure cases, and we need a way to flag the administrator about what the actual status is.

sed-i added Status: Triage Type: Bug labels Aug 16, 2022

simskij added the Priority: Medium label Aug 24, 2022

sed-i mentioned this issue Sep 2, 2022

Fix ingress usage #349

Merged

sed-i mentioned this issue Sep 7, 2022

Less frequent reloads and restarts #352

Merged

sed-i removed the Status: Triage label Sep 7, 2022

sed-i closed this as completed in #352 Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload is called unnecessarily #342

Reload is called unnecessarily #342

sed-i commented Aug 16, 2022 •

edited

Loading

sed-i commented Sep 6, 2022

rbarry82 commented Sep 7, 2022

Reload is called unnecessarily #342

Reload is called unnecessarily #342

Comments

sed-i commented Aug 16, 2022 • edited Loading

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

sed-i commented Sep 6, 2022

rbarry82 commented Sep 7, 2022

sed-i commented Aug 16, 2022 •

edited

Loading