Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async calls get stuck if controller dies #2214

Closed
matt2e opened this issue Jul 31, 2024 · 1 comment · Fixed by #2435
Closed

Async calls get stuck if controller dies #2214

matt2e opened this issue Jul 31, 2024 · 1 comment · Fixed by #2435
Assignees

Comments

@matt2e
Copy link
Collaborator

matt2e commented Jul 31, 2024

Steps to reproduce

  • Comment out db cleanup code used in dev: https://github.com/TBD54566975/ftl/blob/main/backend/controller/sql/databasetesting/devel.go#L77
  • Start FTL and trigger an async call that takes some time. Here's what I did:
    • Make this verb basically just do time.Sleep(time.Minute); return nil
    • ftl dev --log-level=DEBUG backend/controller/pubsub/testdata/go/subscriber backend/controller/pubsub/testdata/go/publisher --recreate
    • Go to the console and trigger the verb publisher.publishOneToTopic2
  • Wait a few seconds so we know the async call has started
  • Kill FTL (ctrl-C)
  • Re-run FTL but without --recreate

Expected result:

  • The lease associated with the previous attempt of the async call with expire, and the async call will be attempted again or fail, depending on the retry policy.

Actual result:

  • The lease gets reaped
  • The async call row loses its reference to the lease, but remains in an executing state forever
  • This blocks the associated subscription / FSM forever

I think this happens for any case where async call leases expire, like:

  • controller losing connection to the db momentarily
  • controller dying

It think it might also happen in this case:

  • controller executes executeAsyncCalls()
  • it successfully acquires a call with `AcquireAsyncCall()
  • it defers release of the call lease
  • CompleteAsyncCall fails for any reason (eg db connection error)
    Similar to what's outlined above, we now have a released lease and a async call row that will sit in an executing state forever
@github-actions github-actions bot added the triage Issue needs triaging label Jul 31, 2024
@ftl-robot ftl-robot mentioned this issue Jul 31, 2024
@matt2e
Copy link
Collaborator Author

matt2e commented Jul 31, 2024

I think in other use cases for leases we have special logic to handle the necessary knock on effects, but it seems like we didn't do it for async calls, or it's not working

@jonathanj-square jonathanj-square added the next Work that will be be picked up next label Jul 31, 2024
@github-actions github-actions bot removed the triage Issue needs triaging label Jul 31, 2024
@matt2e matt2e self-assigned this Aug 19, 2024
@github-actions github-actions bot removed the next Work that will be be picked up next label Aug 19, 2024
github-merge-queue bot pushed a commit that referenced this issue Aug 27, 2024
fixes #2214

Changes:
- The controller now calls the verb within the lease context, so a
failure to heartbeat the lease cancels the grpc call
- A new periodic task that cleans up async calls with a state of
executing but without a lease
    - This happens after the lease has been reaped in a separate job
    - Cleaning up involves:
        - Scheduling retries/catches according to the policy
- Triggering the origin-specific code (eg: fsm async calls need the next
event table to be cleared)

Known issues:
- If there is a repeating error cleaning up an async call, then it will
just repeatedly fail
- Java does not currently have a way to stop execution of the call that
gets canceled, so it will always continue even when the async call's
lease expires

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants