Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboots done via juju ssh are causing hook errors intermittently #921

Open
dshcherb opened this issue Sep 16, 2022 · 4 comments
Open

Reboots done via juju ssh are causing hook errors intermittently #921

dshcherb opened this issue Sep 16, 2022 · 4 comments

Comments

@dshcherb
Copy link
Collaborator

dshcherb commented Sep 16, 2022

https://bugs.launchpad.net/juju/+bug/1989629 - the issue description itself
https://bugs.launchpad.net/juju/+bug/1989629/comments/4 - root cause analysis

  • an environment is set up by a test (like enable_dpdk) and reboots a machine via juju ssh to apply some changes;
  • asynchronously, a juju agent decides to execute a hook like update-status before the agent is stopped by systemd;
  • the host comes up and brings up a unit agent which reports an error about the update-status execution.

This was bugging us in https://review.opendev.org/c/x/charm-ovn-chassis/+/856548/ due to its intermittent nature.

The codepath that leads to this:

self.enable_hugepages_vfio_on_hvs_in_vms(4)

zaza.utilities.machine_os.enable_hugepages(

https://github.com/openstack-charmers/zaza/blob/8f9f9c79b246ef09a632d40323c975f002fcd4cf/zaza/utilities/machine_os.py#L231

https://github.com/openstack-charmers/zaza/blob/8f9f9c79b246ef09a632d40323c975f002fcd4cf/zaza/utilities/machine_os.py#L203

https://github.com/openstack-charmers/zaza/blob/8f9f9c79b246ef09a632d40323c975f002fcd4cf/zaza/utilities/generic.py#L493-L504

@ajkavanagh
Copy link
Collaborator

There's a juju-reboot available within a unit to do a reboot, that effectively does:

#!/bin/bash
sleep 15
shutdown -r now

whereas the zaza code runs this:

def reboot(unit_name):
    """Reboot unit.
    :param unit_name: Unit Name
    :type unit_name: str
    :returns: None
    :rtype: None
    """
    # NOTE: When used with series upgrade the agent will be down.
    # Even juju run will not work
    cmd = ['juju', 'ssh', unit_name, 'sudo', 'reboot', '&&', 'exit']

Both reboot and shutdown are symlinks to /sbin/systemctl so I suspect they are doing exactly the same thing. Suspect a genuine bug in juju, and not sure what a workaround would be. I guess using juju run would at least not interrupt a running hook, which you allude to in linked juju bug.

@dshcherb
Copy link
Collaborator Author

dshcherb commented Sep 19, 2022

What's interesting is that using juju-reboot is not supported while running an action from Juju's perspective:

juju run --unit ovn-dedicated-chassis/0 'juju-reboot --now' ; echo $?
ERROR juju-reboot is not supported when running an action.
1

Neither can I use juju-run via juju run:

$ juju run --unit ovn-dedicated-chassis/0 'juju-run' ; echo $?
ERROR cannot use "juju-run" as an action command (not supported)
1

However, I can do:

juju ssh ovn-dedicated-chassis/0 'sudo juju-run -u ovn-dedicated-chassis/0 "juju-reboot --now"' ; echo $?
Connection to 10.10.20.40 closed.
0
$ juju show-status-log ovn-dedicated-chassis/0
Time                        Type       Status     Message
# ...
19 Sep 2022 15:01:59+03:00  workload   active     Unit is ready
19 Sep 2022 15:02:19+03:00  juju-unit  executing  running action juju-run
19 Sep 2022 15:02:19+03:00  juju-unit  idle       
19 Sep 2022 15:05:02+03:00  juju-unit  executing  running commands
19 Sep 2022 15:05:02+03:00  juju-unit  rebooting  
19 Sep 2022 15:06:22+03:00  juju-unit  executing  running start hook
19 Sep 2022 15:06:25+03:00  workload   active     Unit is ready
19 Sep 2022 15:06:25+03:00  juju-unit  idle

@dshcherb
Copy link
Collaborator Author

What's interesting is that using juju-reboot is not supported while running an action from Juju's perspective:

juju run --unit ovn-dedicated-chassis/0 'juju-reboot --now' ; echo $?
ERROR juju-reboot is not supported when running an action.
1

I filed a bug so that this can be addressed if the Juju team thinks it's a good idea https://bugs.launchpad.net/juju/+bug/1990140

Meanwhile I think we can use this:

juju ssh <unit-name> 'sudo juju-run -u <unit-name> "juju-reboot --now"'

@ajkavanagh
Copy link
Collaborator

juju ssh <unit-name> 'sudo juju-run -u <unit-name> "juju-reboot --now"'

That is so convoluted! i.e. "unit, please run this when you can".

One thing we will need to watch for in tests is that this is really async, and so the time between issuing this command and the start of the reboot is <---.....(???)....--> and so we probably need a mechanism to detect when the reboot has occurred. Ubuntu, helpfully, deletes a file called /var/run/reboot-required when it reboots which can be used as a sentinel for whether a unit has completed a reboot cycle. This code in zaza/openstack/utilities/parallel_series_upgrade.py#L618

        logging.info("dist-upgrade required reboot machine: %s", machine)
        await reboot(machine)
        logging.info("Waiting for machine to come back afer reboot: %s",
                     machine)
        await model.async_block_until_file_missing_on_machine(
            machine, "/var/run/reboot-required")
        logging.info("Waiting for machine idleness on %s", machine)
        await asyncio.sleep(5.0)
        await model.async_block_until_units_on_machine_are_idle(machine)

does the right thing, so maybe it should be integrated into the reboot function? It would need a timeout, obviously, but it's perhaps something to build on, plus, the code to actually create the /var/run/reboot-required (which in this case is created by the apt-upgrade command (earlier in this above code path).

fnordahl pushed a commit to fnordahl/zaza that referenced this issue Jul 21, 2023
Issue #921 has the context for the change: in order to avoid triggering
reboots asynchronously to juju hook executions (when `juju ssh reboot`
is done).

openstack-charmers/zaza-openstack-tests#921
(cherry picked from commit 3b74604)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants