-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: restart services failed within okay delay #520
feat: restart services failed within okay delay #520
Conversation
It seems the test case After some testing, it might be the case that the setup function writes $ go test -race -c ./internals/overlord/servstate/
$ ./servstate.test -check.v -check.f ^S\.TestStartFastExitCommand$
2024-11-15T05:47:00.717Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
2024-11-15T05:47:00.718Z [test] Service "test4" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-15T05:47:00.718Z [test] Change 1 task (Start service "test4") failed: service start attempt: exited quickly with code 0, will retry
PASS: internals/overlord/servstate/manager_test.go:634: S.TestStartFastExitCommand 0.006s
OK: 1 passed
PASS
==================
WARNING: DATA RACE
Read at 0x000000a405f8 by goroutine 22:
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal()
/home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:433 +0xe90
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed()
/home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:725 +0xf8
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
/home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:604 +0x2c
Previous write at 0x000000a405f8 by goroutine 21:
github.com/canonical/pebble/internals/overlord/servstate_test.(*S).SetUpTest.FakeKillFailDelay.func3()
/home/ubuntu/work/pebble3/internals/overlord/servstate/export_test.go:81 +0x38
github.com/canonical/pebble/internals/testutil.(*BaseTest).TearDownTest()
/home/ubuntu/work/pebble3/internals/testutil/base.go:38 +0xc4
github.com/canonical/pebble/internals/overlord/servstate_test.(*S).TearDownTest()
/home/ubuntu/work/pebble3/internals/overlord/servstate/manager_test.go:180 +0x84
runtime.call16()
/usr/local/go/src/runtime/asm_arm64.s:503 +0x74
reflect.Value.Call()
/usr/local/go/src/reflect/value.go:380 +0x90
gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
/home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:724 +0x100
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
/home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xcc
|
In my latest commit, I made two changes to eliminate these:
The race condition arises because However, the So, while refactoring is generally a preferred approach for long-term maintainability, here I mitigated the race condition in the tests in a simpler way: with a mutex protecting access to the global variable |
Note that I've also assigned #495 to you -- I think that would be good to look at right after this is merged, while you're in this code. I think it might be quite simple. |
Co-authored-by: Ben Hoyt <[email protected]>
Co-authored-by: Ben Hoyt <[email protected]>
After some manual testing, a few things are changed:
Manual tests: Layer config: $ cat 001-simple-layer.yaml
summary: a simple layer
services:
test:
override: replace
command: ls
startup: enabled
on-success: ignore Pebble run: $ ./pebble run
2024-11-25T05:15:11.774Z [pebble] Started daemon.
2024-11-25T05:15:11.781Z [pebble] POST /v1/services 2.325878ms 202
2024-11-25T05:15:11.783Z [pebble] Service "test" starting: ls
2024-11-25T05:15:11.787Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-25T05:15:11.787Z [pebble] Service "test" on-success action is "ignore", not doing anything further
2024-11-25T05:15:11.789Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 0, will ignore
2024-11-25T05:15:11.793Z [pebble] GET /v1/changes/1/wait 11.047219ms 200
2024-11-25T05:15:11.793Z [pebble] Started default services with change 1. It was ignored, not restarted. After it exits, check the services: $ ./pebble services
Service Startup Current Since
test enabled inactive today at 13:15 CST And logs: $ ./pebble logs
2024-11-25T05:15:11.787Z [test] COPYING
2024-11-25T05:15:11.787Z [test] HACKING.md
2024-11-25T05:15:11.787Z [test] README.md
2024-11-25T05:15:11.787Z [test] SECURITY.md
2024-11-25T05:15:11.787Z [test] client
2024-11-25T05:15:11.787Z [test] cmd
2024-11-25T05:15:11.787Z [test] daemon.test
2024-11-25T05:15:11.787Z [test] docs
2024-11-25T05:15:11.787Z [test] go.mod
2024-11-25T05:15:11.787Z [test] go.sum
2024-11-25T05:15:11.787Z [test] internals
2024-11-25T05:15:11.787Z [test] pebble
2024-11-25T05:15:11.787Z [test] snap
2024-11-25T05:15:11.787Z [test] tests For non-zero exit code, layer config: $ cat 001-simple-layer.yaml
summary: a simple layer
services:
test:
override: replace
command: bash -c "ls; exit 1"
startup: enabled
on-failure: ignore Pebble run: $ ./pebble run
2024-11-25T05:17:59.115Z [pebble] Started daemon.
2024-11-25T05:17:59.122Z [pebble] POST /v1/services 2.561335ms 202
2024-11-25T05:17:59.126Z [pebble] Service "test" starting: bash -c "ls; exit 1"
2024-11-25T05:17:59.130Z [pebble] Service "test" stopped unexpectedly with code 1
2024-11-25T05:17:59.130Z [pebble] Service "test" on-failure action is "ignore", not doing anything further
2024-11-25T05:17:59.133Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 1, will ignore
2024-11-25T05:17:59.138Z [pebble] GET /v1/changes/1/wait 13.611051ms 200
2024-11-25T05:17:59.138Z [pebble] Started default services with change 1. It was ignored, not restarted, and the exit code was non-zero. After it exits, check the services: $ ./pebble services
Service Startup Current Since
test enabled error today at 13:17 CST And logs: $ ./pebble logs
2024-11-25T05:17:59.130Z [test] COPYING
2024-11-25T05:17:59.130Z [test] HACKING.md
2024-11-25T05:17:59.130Z [test] README.md
2024-11-25T05:17:59.130Z [test] SECURITY.md
2024-11-25T05:17:59.130Z [test] client
2024-11-25T05:17:59.130Z [test] cmd
2024-11-25T05:17:59.130Z [test] daemon.test
2024-11-25T05:17:59.130Z [test] docs
2024-11-25T05:17:59.130Z [test] go.mod
2024-11-25T05:17:59.130Z [test] go.sum
2024-11-25T05:17:59.130Z [test] internals
2024-11-25T05:17:59.130Z [test] pebble
2024-11-25T05:17:59.130Z [test] snap
2024-11-25T05:17:59.130Z [test] tests |
Per discussion, let's remove removeService / removeServiceInternal altogether -- we don't want the logs disappearing when service.start() has any error. And instead of removeService we'd do |
According to a discussion with @benhoyt, we decided to:
The above changes are included in the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good, and I've tested it locally. Per voice discussion, we decided not to include a "temporary" or "retrying" flag to the task (per notes from Nov 14 Review Meeting) at this point, as charmers can just catch the ChangeError and log/ignore. Can always add that flag later if need be.
Restart services failed within the okay delay period.
Fixes #240, #495 (the same line of code change will close two issues at the same time, and simplifies the code, because there is no need for a specific restart error type).
See previous discussion here, approved spec here, previous PoC here, here, and here.
Per voice discussion, we decided to just get rid of the state diagram for now, and if we add this back later, we'd do a nice hand-drawn one for developer documentation (with nice arrows).