Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in pubsub lib Topic #2554

Closed
wesbillman opened this issue Aug 29, 2024 · 2 comments · Fixed by #2563
Closed

panic in pubsub lib Topic #2554

wesbillman opened this issue Aug 29, 2024 · 2 comments · Fixed by #2563
Assignees
Labels
bug Something isn't working P0 Work on this now

Comments

@wesbillman
Copy link
Collaborator

wesbillman commented Aug 29, 2024

Running just e2e-frontend in CI is getting a panic here: https://github.com/TBD54566975/ftl/actions/runs/10622270938/job/29446142300#step:6:1238

This command is running ftl dev --recreate under the hood.

panic: ack timeout

goroutine 180 [running]:
github.com/alecthomas/types/pubsub.(*Topic[...]).run(0x2203100)
	/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:197 +0x6cb
created by github.com/alecthomas/types/pubsub.New[...] in goroutine 127
/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:67

Note: this is not necessarily part of FTL's pubsub feature. It is caused by something using @alecthomass pubsub library

@wesbillman wesbillman added bug Something isn't working P0 Work on this now labels Aug 29, 2024
@github-actions github-actions bot added the triage Issue needs triaging label Aug 29, 2024
@ftl-robot ftl-robot mentioned this issue Aug 29, 2024
@jvmakine jvmakine added next Work that will be be picked up next and removed triage Issue needs triaging labels Aug 29, 2024
@matt2e matt2e changed the title panic in pubsub Topic panic in pubsub lib Topic Aug 30, 2024
@matt2e
Copy link
Collaborator

matt2e commented Aug 30, 2024

This happened in production at Aug 29, 2024 at 2:10:40.517 pm AEST
This happened 38s after the first occurence of this issue on the same controller: #2539
The second time we saw that other issue, it was was not followed by this issue. Possibly because they are unrelated, or because that controller's pubsub set up was broken so subsequent issues didn't cause a similar panic.

production logs:

panic: ack timeout
goroutine 31 [running]:
github.com/alecthomas/types/pubsub.(*Topic[...]).run(0x1cf3dc0)
/root/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:197 +0x6cb
created by github.com/alecthomas/types/pubsub.New[...] in goroutine 1
/root/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:67 +0x10e

@matt2e
Copy link
Collaborator

matt2e commented Aug 30, 2024

line in the lib: https://github.com/alecthomas/types/blob/92ffae5908acce44483cd09ed1c7918fea61f7d8/pubsub/pubsub.go#L197

We use the lib in a few places, but judging from the linked CI logs, it could be:

I don't think it can be configuration.cache as I don't think thatd be used in the integration test.

@matt2e matt2e self-assigned this Aug 30, 2024
@github-actions github-actions bot removed the next Work that will be be picked up next label Aug 30, 2024
github-merge-queue bot pushed a commit that referenced this issue Aug 30, 2024
…#2563)

fixes #2554

Current theory is this:
- cluster has a low number of modules active (let's say 0)
- gRPC call comes into controller to PullSchema
- this causes us to subscribe to schema changes, with a chan with the
length of the count of the current active modules (in our example, 0)
- The lib then creates an extra buffer to queue up messages while the
subscriber processes messages, with the buffer being the same size as
the provided chan (again, 0)
- As a change schema change message is received by the subscriber, we
stream the message back over gRPC. This may take time.
- While this is happening, another schema update may occur. the lib will
not be able to ack the message because the buffer size is too small (0)
so it will wait for the message to be received.
- the lib will timeout if the networking is not done fast enough.

Repro'd by doing the following:
- Added a sleep step before sending the schema update over the network.
- Start up FTL without any modules
- Ran `ftl schema get --watch`
- Ran `ftl deploy` with a bunch of module
- Hit the `ack timout` panic
Could not repro after making the chan length always be a decent size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Work on this now
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants