panic in pubsub lib Topic #2554

wesbillman · 2024-08-29T20:46:56Z

Running just e2e-frontend in CI is getting a panic here: https://github.com/TBD54566975/ftl/actions/runs/10622270938/job/29446142300#step:6:1238

This command is running ftl dev --recreate under the hood.

panic: ack timeout

goroutine 180 [running]:
github.com/alecthomas/types/pubsub.(*Topic[...]).run(0x2203100)
	/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:197 +0x6cb
created by github.com/alecthomas/types/pubsub.New[...] in goroutine 127
/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:67

Note: this is not necessarily part of FTL's pubsub feature. It is caused by something using @alecthomass pubsub library

The text was updated successfully, but these errors were encountered:

matt2e · 2024-08-30T00:07:38Z

This happened in production at Aug 29, 2024 at 2:10:40.517 pm AEST
This happened 38s after the first occurence of this issue on the same controller: #2539
The second time we saw that other issue, it was was not followed by this issue. Possibly because they are unrelated, or because that controller's pubsub set up was broken so subsequent issues didn't cause a similar panic.

production logs:

panic: ack timeout
goroutine 31 [running]:
github.com/alecthomas/types/pubsub.(*Topic[...]).run(0x1cf3dc0)
/root/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:197 +0x6cb
created by github.com/alecthomas/types/pubsub.New[...] in goroutine 1
/root/go/pkg/mod/github.com/alecthomas/[email protected]/pubsub/pubsub.go:67 +0x10e

matt2e · 2024-08-30T06:30:55Z

line in the lib: https://github.com/alecthomas/types/blob/92ffae5908acce44483cd09ed1c7918fea61f7d8/pubsub/pubsub.go#L197

We use the lib in a few places, but judging from the linked CI logs, it could be:

DAL.DeploymentChanges https://github.com/TBD54566975/ftl/blob/main/backend/controller/dal/dal.go#L244
buildengine.schemaChanges https://github.com/TBD54566975/ftl/blob/main/internal/buildengine/engine.go#L73

I don't think it can be configuration.cache as I don't think thatd be used in the integration test.

…#2563) fixes #2554 Current theory is this: - cluster has a low number of modules active (let's say 0) - gRPC call comes into controller to PullSchema - this causes us to subscribe to schema changes, with a chan with the length of the count of the current active modules (in our example, 0) - The lib then creates an extra buffer to queue up messages while the subscriber processes messages, with the buffer being the same size as the provided chan (again, 0) - As a change schema change message is received by the subscriber, we stream the message back over gRPC. This may take time. - While this is happening, another schema update may occur. the lib will not be able to ack the message because the buffer size is too small (0) so it will wait for the message to be received. - the lib will timeout if the networking is not done fast enough. Repro'd by doing the following: - Added a sleep step before sending the schema update over the network. - Start up FTL without any modules - Ran `ftl schema get --watch` - Ran `ftl deploy` with a bunch of module - Hit the `ack timout` panic Could not repro after making the chan length always be a decent size.

wesbillman added bug Something isn't working P0 Work on this now labels Aug 29, 2024

github-actions bot added the triage Issue needs triaging label Aug 29, 2024

ftl-robot mentioned this issue Aug 29, 2024

Dashboard #728

Open

jvmakine added next Work that will be be picked up next and removed triage Issue needs triaging labels Aug 29, 2024

matt2e changed the title ~~panic in pubsub Topic~~ panic in pubsub lib Topic Aug 30, 2024

matt2e self-assigned this Aug 30, 2024

github-actions bot removed the next Work that will be be picked up next label Aug 30, 2024

matt2e mentioned this issue Aug 30, 2024

fix: prevent tiny buffer preventing subscriber acking message quickly #2563

Merged

matt2e closed this as completed in #2563 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in pubsub lib Topic #2554

panic in pubsub lib Topic #2554

wesbillman commented Aug 29, 2024 •

edited by matt2e

Loading

matt2e commented Aug 30, 2024

matt2e commented Aug 30, 2024

panic in pubsub lib Topic #2554

panic in pubsub lib Topic #2554

Comments

wesbillman commented Aug 29, 2024 • edited by matt2e Loading

matt2e commented Aug 30, 2024

matt2e commented Aug 30, 2024

wesbillman commented Aug 29, 2024 •

edited by matt2e

Loading