feat: restart services failed within okay delay #520

IronCore864 · 2024-11-15T04:11:23Z

Restart services failed within the okay delay period.

Fixes #240, #495 (the same line of code change will close two issues at the same time, and simplifies the code, because there is no need for a specific restart error type).

See previous discussion here, approved spec here, previous PoC here, here, and here.

Per voice discussion, we decided to just get rid of the state diagram for now, and if we add this back later, we'd do a nice hand-drawn one for developer documentation (with nice arrows).

IronCore864 · 2024-11-15T05:58:38Z

It seems the test case TestStartFastExitCommand in manager_test.go has some race conditions.

After some testing, it might be the case that the setup function writes killDelayDefault and startInternal reads it. Need some help here, is it the root cause, and if yes, how to fix?

$ go test -race -c ./internals/overlord/servstate/
$ ./servstate.test -check.v -check.f ^S\.TestStartFastExitCommand$
2024-11-15T05:47:00.717Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
2024-11-15T05:47:00.718Z [test] Service "test4" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-15T05:47:00.718Z [test] Change 1 task (Start service "test4") failed: service start attempt: exited quickly with code 0, will retry
PASS: internals/overlord/servstate/manager_test.go:634: S.TestStartFastExitCommand	0.006s
OK: 1 passed
PASS
==================
WARNING: DATA RACE
Read at 0x000000a405f8 by goroutine 22:
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:433 +0xe90
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:725 +0xf8
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:604 +0x2c

Previous write at 0x000000a405f8 by goroutine 21:
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).SetUpTest.FakeKillFailDelay.func3()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/export_test.go:81 +0x38
  github.com/canonical/pebble/internals/testutil.(*BaseTest).TearDownTest()
      /home/ubuntu/work/pebble3/internals/testutil/base.go:38 +0xc4
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).TearDownTest()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/manager_test.go:180 +0x84
  runtime.call16()
      /usr/local/go/src/runtime/asm_arm64.s:503 +0x74
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:380 +0x90
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:724 +0x100
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xcc

IronCore864 · 2024-11-19T02:39:49Z

It seems the test case TestStartFastExitCommand in manager_test.go has some race conditions.

After some testing, it might be the case that the setup function writes killDelayDefault and startInternal reads it. Need some help here, is it the root cause, and if yes, how to fix?

$ go test -race -c ./internals/overlord/servstate/
$ ./servstate.test -check.v -check.f ^S\.TestStartFastExitCommand$
2024-11-15T05:47:00.717Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
2024-11-15T05:47:00.718Z [test] Service "test4" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-15T05:47:00.718Z [test] Change 1 task (Start service "test4") failed: service start attempt: exited quickly with code 0, will retry
PASS: internals/overlord/servstate/manager_test.go:634: S.TestStartFastExitCommand	0.006s
OK: 1 passed
PASS
==================
WARNING: DATA RACE
Read at 0x000000a405f8 by goroutine 22:
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:433 +0xe90
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:725 +0xf8
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:604 +0x2c

Previous write at 0x000000a405f8 by goroutine 21:
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).SetUpTest.FakeKillFailDelay.func3()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/export_test.go:81 +0x38
  github.com/canonical/pebble/internals/testutil.(*BaseTest).TearDownTest()
      /home/ubuntu/work/pebble3/internals/testutil/base.go:38 +0xc4
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).TearDownTest()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/manager_test.go:180 +0x84
  runtime.call16()
      /usr/local/go/src/runtime/asm_arm64.s:503 +0x74
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:380 +0x90
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:724 +0x100
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xcc

In my latest commit, I made two changes to eliminate these:

use a mutex for default kill delay
use a longer backoff-delay so that backoff/startInternal won't happen again when the test case/reaper is tearing down/stopping.

The race condition arises because killDelayDefault is a global variable being modified in the test setup and read concurrently by other goroutines (likely related to service operations). The solution is to avoid using a global variable and instead pass the kill delay as a parameter or store it within the serviceData struct.

However, the serviceData struct is used in multiple places and refactoring it seems to be some major work which I'm not so sure about, and it's debatable whether we should put the delay as part of it.

So, while refactoring is generally a preferred approach for long-term maintainability, here I mitigated the race condition in the tests in a simpler way: with a mutex protecting access to the global variable killDelayDefault. IMHO this is less ideal than refactoring but can be a workable solution.

docs/how-to/service-dependencies.md

internals/overlord/servstate/export_test.go

internals/overlord/servstate/handlers.go

internals/overlord/servstate/state-diagram.dot

internals/overlord/servstate/manager_test.go

docs/how-to/service-dependencies.md

internals/daemon/api_signals_test.go

internals/overlord/servstate/handlers.go

internals/overlord/servstate/state-diagram.dot

benhoyt · 2024-11-21T02:48:08Z

Note that I've also assigned #495 to you -- I think that would be good to look at right after this is merged, while you're in this code. I think it might be quite simple.

Co-authored-by: Ben Hoyt <[email protected]>

IronCore864 · 2024-11-25T09:34:11Z

After some manual testing, a few things are changed:

The test case TestSignalsSend (where a service is started then killed) is updated with on-failure ignore, and the expected status is updated to "error". A comment in the code is added to explain why the expected status is "error" instead of "inactive".
The test case TestStartFastExitCommandOnFailureIgnore is updated the same as above for the same reason.
Service is not removed if it exits within okay delay, so that "pebble logs" does not show logs of short-running / failing processes #495 can be fixed. In this case, logs are not closed, and according to discussion elsewhere and testing, it doesn't hurt if we don't close the logs.
After testing and investigation, removeServiceInternal is not refactored to two functions as discussed (something like closeServiceLogs and removeService), because it's only used in two places and both don't matter:

In ServiceManager.Stop, it's OK to close logs and remove services at the same time since the service manager is stopping.
In removeService, it's used when service.start() fails, and when the user tried to abort the start. In both cases, it's also OK to close the logs and remove the services.

Manual tests:

Layer config:

$ cat 001-simple-layer.yaml
summary: a simple layer
services:
  test:
    override: replace
    command: ls
    startup: enabled
    on-success: ignore

Pebble run:

$ ./pebble run
2024-11-25T05:15:11.774Z [pebble] Started daemon.
2024-11-25T05:15:11.781Z [pebble] POST /v1/services 2.325878ms 202
2024-11-25T05:15:11.783Z [pebble] Service "test" starting: ls
2024-11-25T05:15:11.787Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-25T05:15:11.787Z [pebble] Service "test" on-success action is "ignore", not doing anything further
2024-11-25T05:15:11.789Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 0, will ignore
2024-11-25T05:15:11.793Z [pebble] GET /v1/changes/1/wait 11.047219ms 200
2024-11-25T05:15:11.793Z [pebble] Started default services with change 1.

It was ignored, not restarted.

After it exits, check the services:

$ ./pebble services
Service  Startup  Current   Since
test     enabled  inactive  today at 13:15 CST

And logs:

$ ./pebble logs
2024-11-25T05:15:11.787Z [test] COPYING
2024-11-25T05:15:11.787Z [test] HACKING.md
2024-11-25T05:15:11.787Z [test] README.md
2024-11-25T05:15:11.787Z [test] SECURITY.md
2024-11-25T05:15:11.787Z [test] client
2024-11-25T05:15:11.787Z [test] cmd
2024-11-25T05:15:11.787Z [test] daemon.test
2024-11-25T05:15:11.787Z [test] docs
2024-11-25T05:15:11.787Z [test] go.mod
2024-11-25T05:15:11.787Z [test] go.sum
2024-11-25T05:15:11.787Z [test] internals
2024-11-25T05:15:11.787Z [test] pebble
2024-11-25T05:15:11.787Z [test] snap
2024-11-25T05:15:11.787Z [test] tests

For non-zero exit code, layer config:

$ cat 001-simple-layer.yaml
summary: a simple layer
services:
  test:
    override: replace
    command: bash -c "ls; exit 1"
    startup: enabled
    on-failure: ignore

Pebble run:

$ ./pebble run
2024-11-25T05:17:59.115Z [pebble] Started daemon.
2024-11-25T05:17:59.122Z [pebble] POST /v1/services 2.561335ms 202
2024-11-25T05:17:59.126Z [pebble] Service "test" starting: bash -c "ls; exit 1"
2024-11-25T05:17:59.130Z [pebble] Service "test" stopped unexpectedly with code 1
2024-11-25T05:17:59.130Z [pebble] Service "test" on-failure action is "ignore", not doing anything further
2024-11-25T05:17:59.133Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 1, will ignore
2024-11-25T05:17:59.138Z [pebble] GET /v1/changes/1/wait 13.611051ms 200
2024-11-25T05:17:59.138Z [pebble] Started default services with change 1.

It was ignored, not restarted, and the exit code was non-zero.

After it exits, check the services:

$ ./pebble services
Service  Startup  Current  Since
test     enabled  error    today at 13:17 CST

And logs:

$ ./pebble logs
2024-11-25T05:17:59.130Z [test] COPYING
2024-11-25T05:17:59.130Z [test] HACKING.md
2024-11-25T05:17:59.130Z [test] README.md
2024-11-25T05:17:59.130Z [test] SECURITY.md
2024-11-25T05:17:59.130Z [test] client
2024-11-25T05:17:59.130Z [test] cmd
2024-11-25T05:17:59.130Z [test] daemon.test
2024-11-25T05:17:59.130Z [test] docs
2024-11-25T05:17:59.130Z [test] go.mod
2024-11-25T05:17:59.130Z [test] go.sum
2024-11-25T05:17:59.130Z [test] internals
2024-11-25T05:17:59.130Z [test] pebble
2024-11-25T05:17:59.130Z [test] snap
2024-11-25T05:17:59.130Z [test] tests

benhoyt · 2024-11-27T02:22:32Z

Per discussion, let's remove removeService / removeServiceInternal altogether -- we don't want the logs disappearing when service.start() has any error. And instead of removeService we'd do service.transition(stateStopped).

IronCore864 · 2024-11-27T07:47:27Z

According to a discussion with @benhoyt, we decided to:

Do not close service logs anymore because service logs are ring buffers, and it doesn't matter if we don't close them. Plus, if the service exits with an error, having logs still accessible helps debugging. This means we don't close the logs in startInternal when reaper.StartCommand fails.
servicemanager.Stop is removed because it's essentially closing service logs, which doesn't need to be done.
removeService and removeServiceInternal are removed for the same reason above.
Properly handle state when service can't be started: if starting the service fails for some reason, transition it to the stopped state, which is the only appropriate choice. However, if the service can't be started because it's in another state (for example, terminating), do not transition the state. This part is different from our discussion, and I think this is the appropriate thing to do. Because otherwise there will be deadlock issues: Imagine the service is already in the terminating state, start fails, transitions to stopped, then the stop task would never finish.
Properly handle the state when the user tries to abort the start. In this case, it makes sense to transition into the stopped state. (This feels a bit redundant, because if the service is still starting and the user aborted it, we send SIGKILL, exited function will handle it properly because we added fallthrough. But adding a transition here doesn't change the logic, so it's OK to be a little bit more explicit.)

The above changes are included in the latest commit.

benhoyt

I think this is good, and I've tested it locally. Per voice discussion, we decided not to include a "temporary" or "retrying" flag to the task (per notes from Nov 14 Review Meeting) at this point, as charmers can just catch the ChangeError and log/ignore. Can always add that flag later if need be.

feat: restart services failed within okay delay

ead7256

IronCore864 requested a review from benhoyt November 15, 2024 05:58

fix: use mutex for kill delay and use longer back off delay in test

a9a1ac0

IronCore864 marked this pull request as ready for review November 19, 2024 02:40

benhoyt reviewed Nov 20, 2024

View reviewed changes

IronCore864 added 2 commits November 20, 2024 20:57

chore: refactor after review and test

2668b62

chore: remove unnecessary change

64829e5

IronCore864 requested a review from benhoyt November 20, 2024 13:09

benhoyt mentioned this pull request Nov 21, 2024

"pebble logs" does not show logs of short-running / failing processes #495

Closed

benhoyt requested changes Nov 21, 2024

View reviewed changes

IronCore864 and others added 5 commits November 23, 2024 10:47

Update docs/how-to/service-dependencies.md

499c03c

Co-authored-by: Ben Hoyt <[email protected]>

Update docs/how-to/service-dependencies.md

28e7fd1

Co-authored-by: Ben Hoyt <[email protected]>

chore: refactor after review

e05f4e4

chore: remove state transfer diagram according to discussion

7a6897b

chore: refactor after review, fix short running svc logs

8d50e82

IronCore864 requested a review from benhoyt November 25, 2024 09:34

chore: remove removeService and simply code

176de99

benhoyt reviewed Nov 28, 2024

View reviewed changes

benhoyt approved these changes Nov 28, 2024

View reviewed changes

IronCore864 merged commit 4559a6a into canonical:master Nov 28, 2024
18 checks passed

IronCore864 deleted the restart-services-failed-within-okaydelay branch November 28, 2024 02:03

IronCore864 mentioned this pull request Dec 2, 2024

docs: update okay delay related docs #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: restart services failed within okay delay #520

feat: restart services failed within okay delay #520

IronCore864 commented Nov 15, 2024 •

edited by benhoyt

Loading

IronCore864 commented Nov 15, 2024

IronCore864 commented Nov 19, 2024

benhoyt commented Nov 21, 2024

IronCore864 commented Nov 25, 2024

benhoyt commented Nov 27, 2024

IronCore864 commented Nov 27, 2024 •

edited

Loading

benhoyt left a comment •

edited

Loading

feat: restart services failed within okay delay #520

feat: restart services failed within okay delay #520

Conversation

IronCore864 commented Nov 15, 2024 • edited by benhoyt Loading

IronCore864 commented Nov 15, 2024

IronCore864 commented Nov 19, 2024

benhoyt commented Nov 21, 2024

IronCore864 commented Nov 25, 2024

benhoyt commented Nov 27, 2024

IronCore864 commented Nov 27, 2024 • edited Loading

benhoyt left a comment • edited Loading

Choose a reason for hiding this comment

IronCore864 commented Nov 15, 2024 •

edited by benhoyt

Loading

IronCore864 commented Nov 27, 2024 •

edited

Loading

benhoyt left a comment •

edited

Loading