Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: restart services failed within okay delay #520

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Nov 15, 2024

Restart services failed within the okay delay period.

Fixes #240, #495 (the same line of code change will close two issues at the same time, and simplifies the code, because there is no need for a specific restart error type).

See previous discussion here, approved spec here, previous PoC here, here, and here.

Per voice discussion, we decided to just get rid of the state diagram for now, and if we add this back later, we'd do a nice hand-drawn one for developer documentation (with nice arrows).

@IronCore864
Copy link
Contributor Author

It seems the test case TestStartFastExitCommand in manager_test.go has some race conditions.

After some testing, it might be the case that the setup function writes killDelayDefault and startInternal reads it. Need some help here, is it the root cause, and if yes, how to fix?

$ go test -race -c ./internals/overlord/servstate/
$ ./servstate.test -check.v -check.f ^S\.TestStartFastExitCommand$
2024-11-15T05:47:00.717Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
2024-11-15T05:47:00.718Z [test] Service "test4" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-15T05:47:00.718Z [test] Change 1 task (Start service "test4") failed: service start attempt: exited quickly with code 0, will retry
PASS: internals/overlord/servstate/manager_test.go:634: S.TestStartFastExitCommand	0.006s
OK: 1 passed
PASS
==================
WARNING: DATA RACE
Read at 0x000000a405f8 by goroutine 22:
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:433 +0xe90
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:725 +0xf8
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:604 +0x2c

Previous write at 0x000000a405f8 by goroutine 21:
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).SetUpTest.FakeKillFailDelay.func3()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/export_test.go:81 +0x38
  github.com/canonical/pebble/internals/testutil.(*BaseTest).TearDownTest()
      /home/ubuntu/work/pebble3/internals/testutil/base.go:38 +0xc4
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).TearDownTest()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/manager_test.go:180 +0x84
  runtime.call16()
      /usr/local/go/src/runtime/asm_arm64.s:503 +0x74
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:380 +0x90
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:724 +0x100
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xcc

@IronCore864 IronCore864 requested a review from benhoyt November 15, 2024 05:58
@IronCore864
Copy link
Contributor Author

It seems the test case TestStartFastExitCommand in manager_test.go has some race conditions.

After some testing, it might be the case that the setup function writes killDelayDefault and startInternal reads it. Need some help here, is it the root cause, and if yes, how to fix?

$ go test -race -c ./internals/overlord/servstate/
$ ./servstate.test -check.v -check.f ^S\.TestStartFastExitCommand$
2024-11-15T05:47:00.717Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
2024-11-15T05:47:00.718Z [test] Service "test4" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2024-11-15T05:47:00.718Z [test] Change 1 task (Start service "test4") failed: service start attempt: exited quickly with code 0, will retry
PASS: internals/overlord/servstate/manager_test.go:634: S.TestStartFastExitCommand	0.006s
OK: 1 passed
PASS
==================
WARNING: DATA RACE
Read at 0x000000a405f8 by goroutine 22:
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:433 +0xe90
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:725 +0xf8
  github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/handlers.go:604 +0x2c

Previous write at 0x000000a405f8 by goroutine 21:
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).SetUpTest.FakeKillFailDelay.func3()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/export_test.go:81 +0x38
  github.com/canonical/pebble/internals/testutil.(*BaseTest).TearDownTest()
      /home/ubuntu/work/pebble3/internals/testutil/base.go:38 +0xc4
  github.com/canonical/pebble/internals/overlord/servstate_test.(*S).TearDownTest()
      /home/ubuntu/work/pebble3/internals/overlord/servstate/manager_test.go:180 +0x84
  runtime.call16()
      /usr/local/go/src/runtime/asm_arm64.s:503 +0x74
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:380 +0x90
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:724 +0x100
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xcc

In my latest commit, I made two changes to eliminate these:

  1. use a mutex for default kill delay
  2. use a longer backoff-delay so that backoff/startInternal won't happen again when the test case/reaper is tearing down/stopping.

The race condition arises because killDelayDefault is a global variable being modified in the test setup and read concurrently by other goroutines (likely related to service operations). The solution is to avoid using a global variable and instead pass the kill delay as a parameter or store it within the serviceData struct.

However, the serviceData struct is used in multiple places and refactoring it seems to be some major work which I'm not so sure about, and it's debatable whether we should put the delay as part of it.

So, while refactoring is generally a preferred approach for long-term maintainability, here I mitigated the race condition in the tests in a simpler way: with a mutex protecting access to the global variable killDelayDefault. IMHO this is less ideal than refactoring but can be a workable solution.

@IronCore864 IronCore864 marked this pull request as ready for review November 19, 2024 02:40
docs/how-to/service-dependencies.md Outdated Show resolved Hide resolved
internals/overlord/servstate/export_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/state-diagram.dot Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Show resolved Hide resolved
docs/how-to/service-dependencies.md Outdated Show resolved Hide resolved
docs/how-to/service-dependencies.md Outdated Show resolved Hide resolved
internals/daemon/api_signals_test.go Show resolved Hide resolved
internals/overlord/servstate/handlers.go Show resolved Hide resolved
internals/overlord/servstate/state-diagram.dot Outdated Show resolved Hide resolved
@benhoyt
Copy link
Contributor

benhoyt commented Nov 21, 2024

Note that I've also assigned #495 to you -- I think that would be good to look at right after this is merged, while you're in this code. I think it might be quite simple.

@IronCore864
Copy link
Contributor Author

After some manual testing, a few things are changed:

  1. The test case TestSignalsSend (where a service is started then killed) is updated with on-failure ignore, and the expected status is updated to "error". A comment in the code is added to explain why the expected status is "error" instead of "inactive".
  2. The test case TestStartFastExitCommandOnFailureIgnore is updated the same as above for the same reason.
  3. Service is not removed if it exits within okay delay, so that "pebble logs" does not show logs of short-running / failing processes #495 can be fixed. In this case, logs are not closed, and according to discussion elsewhere and testing, it doesn't hurt if we don't close the logs.
  4. After testing and investigation, removeServiceInternal is not refactored to two functions as discussed (something like closeServiceLogs and removeService), because it's only used in two places and both don't matter:

Manual tests:

Layer config:

$ cat 001-simple-layer.yaml
summary: a simple layer
services:
  test:
    override: replace
    command: ls
    startup: enabled
    on-success: ignore

Pebble run:

$ ./pebble run
2024-11-25T05:15:11.774Z [pebble] Started daemon.
2024-11-25T05:15:11.781Z [pebble] POST /v1/services 2.325878ms 202
2024-11-25T05:15:11.783Z [pebble] Service "test" starting: ls
2024-11-25T05:15:11.787Z [pebble] Service "test" stopped unexpectedly with code 0
2024-11-25T05:15:11.787Z [pebble] Service "test" on-success action is "ignore", not doing anything further
2024-11-25T05:15:11.789Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 0, will ignore
2024-11-25T05:15:11.793Z [pebble] GET /v1/changes/1/wait 11.047219ms 200
2024-11-25T05:15:11.793Z [pebble] Started default services with change 1.

It was ignored, not restarted.

After it exits, check the services:

$ ./pebble services
Service  Startup  Current   Since
test     enabled  inactive  today at 13:15 CST

And logs:

$ ./pebble logs
2024-11-25T05:15:11.787Z [test] COPYING
2024-11-25T05:15:11.787Z [test] HACKING.md
2024-11-25T05:15:11.787Z [test] README.md
2024-11-25T05:15:11.787Z [test] SECURITY.md
2024-11-25T05:15:11.787Z [test] client
2024-11-25T05:15:11.787Z [test] cmd
2024-11-25T05:15:11.787Z [test] daemon.test
2024-11-25T05:15:11.787Z [test] docs
2024-11-25T05:15:11.787Z [test] go.mod
2024-11-25T05:15:11.787Z [test] go.sum
2024-11-25T05:15:11.787Z [test] internals
2024-11-25T05:15:11.787Z [test] pebble
2024-11-25T05:15:11.787Z [test] snap
2024-11-25T05:15:11.787Z [test] tests

For non-zero exit code, layer config:

$ cat 001-simple-layer.yaml
summary: a simple layer
services:
  test:
    override: replace
    command: bash -c "ls; exit 1"
    startup: enabled
    on-failure: ignore

Pebble run:

$ ./pebble run
2024-11-25T05:17:59.115Z [pebble] Started daemon.
2024-11-25T05:17:59.122Z [pebble] POST /v1/services 2.561335ms 202
2024-11-25T05:17:59.126Z [pebble] Service "test" starting: bash -c "ls; exit 1"
2024-11-25T05:17:59.130Z [pebble] Service "test" stopped unexpectedly with code 1
2024-11-25T05:17:59.130Z [pebble] Service "test" on-failure action is "ignore", not doing anything further
2024-11-25T05:17:59.133Z [pebble] Change 1 task (Start service "test") failed: service start attempt: exited quickly with code 1, will ignore
2024-11-25T05:17:59.138Z [pebble] GET /v1/changes/1/wait 13.611051ms 200
2024-11-25T05:17:59.138Z [pebble] Started default services with change 1.

It was ignored, not restarted, and the exit code was non-zero.

After it exits, check the services:

$ ./pebble services
Service  Startup  Current  Since
test     enabled  error    today at 13:17 CST

And logs:

$ ./pebble logs
2024-11-25T05:17:59.130Z [test] COPYING
2024-11-25T05:17:59.130Z [test] HACKING.md
2024-11-25T05:17:59.130Z [test] README.md
2024-11-25T05:17:59.130Z [test] SECURITY.md
2024-11-25T05:17:59.130Z [test] client
2024-11-25T05:17:59.130Z [test] cmd
2024-11-25T05:17:59.130Z [test] daemon.test
2024-11-25T05:17:59.130Z [test] docs
2024-11-25T05:17:59.130Z [test] go.mod
2024-11-25T05:17:59.130Z [test] go.sum
2024-11-25T05:17:59.130Z [test] internals
2024-11-25T05:17:59.130Z [test] pebble
2024-11-25T05:17:59.130Z [test] snap
2024-11-25T05:17:59.130Z [test] tests

@IronCore864 IronCore864 requested a review from benhoyt November 25, 2024 09:34
@benhoyt
Copy link
Contributor

benhoyt commented Nov 27, 2024

Per discussion, let's remove removeService / removeServiceInternal altogether -- we don't want the logs disappearing when service.start() has any error. And instead of removeService we'd do service.transition(stateStopped).

@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 27, 2024

According to a discussion with @benhoyt, we decided to:

  1. Do not close service logs anymore because service logs are ring buffers, and it doesn't matter if we don't close them. Plus, if the service exits with an error, having logs still accessible helps debugging. This means we don't close the logs in startInternal when reaper.StartCommand fails.
  2. servicemanager.Stop is removed because it's essentially closing service logs, which doesn't need to be done.
  3. removeService and removeServiceInternal are removed for the same reason above.
  4. Properly handle state when service can't be started: if starting the service fails for some reason, transition it to the stopped state, which is the only appropriate choice. However, if the service can't be started because it's in another state (for example, terminating), do not transition the state. This part is different from our discussion, and I think this is the appropriate thing to do. Because otherwise there will be deadlock issues: Imagine the service is already in the terminating state, start fails, transitions to stopped, then the stop task would never finish.
  5. Properly handle the state when the user tries to abort the start. In this case, it makes sense to transition into the stopped state. (This feels a bit redundant, because if the service is still starting and the user aborted it, we send SIGKILL, exited function will handle it properly because we added fallthrough. But adding a transition here doesn't change the logic, so it's OK to be a little bit more explicit.)

The above changes are included in the latest commit.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, and I've tested it locally. Per voice discussion, we decided not to include a "temporary" or "retrying" flag to the task (per notes from Nov 14 Review Meeting) at this point, as charmers can just catch the ChangeError and log/ignore. Can always add that flag later if need be.

@IronCore864 IronCore864 merged commit 4559a6a into canonical:master Nov 28, 2024
18 checks passed
@IronCore864 IronCore864 deleted the restart-services-failed-within-okaydelay branch November 28, 2024 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pebble does not try to restart service if exited too quickly for the first time
2 participants