Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestHashRing+TestServiceWithMockDal fail go test -race #1368

Closed
alecthomas opened this issue Apr 30, 2024 · 4 comments · Fixed by #1369
Closed

TestHashRing+TestServiceWithMockDal fail go test -race #1368

alecthomas opened this issue Apr 30, 2024 · 4 comments · Fixed by #1369
Assignees

Comments

@alecthomas
Copy link
Collaborator

alecthomas commented Apr 30, 2024

TestHashRing and TestServiceWithMockDal fail the race detector. It's disabled for now, but needs to be fixed and reenabled.

@alecthomas alecthomas changed the title TestHashRing fails go test -race TestHashRing+TestServiceWithMockDal fail go test -race Apr 30, 2024
@matt2e matt2e linked a pull request May 1, 2024 that will close this issue
@matt2e matt2e reopened this May 1, 2024
@matt2e
Copy link
Collaborator

matt2e commented May 1, 2024

Reopening, hit this case:

panic: test timed out after 10m0s
running tests:
	TestServiceWithMockDal (10m0s)

@matt2e
Copy link
Collaborator

matt2e commented May 1, 2024

Error report is here https://github.com/TBD54566975/ftl/actions/runs/8903975228/job/24452548713?pr=1360

From what I can tell TestServiceWithMockDal is getting stuck in benbjohnson/clock somehow

The main test function is blocked trying to advance the mock clock but is getting stuck when sending a message to its channel:

goroutine 34 [chan send, 9 minutes]:
github.com/benbjohnson/clock.(*internalTimer).Tick(0xc0000e2aa0, {0xe7791f704?, 0xd57b60?, 0xd57b60?})
	/home/runner/go/pkg/mod/github.com/benbjohnson/[email protected]/clock.go:353 +0x185
github.com/benbjohnson/clock.(*Mock).runNextTimer(0xc0001b23c0, {0xe7791f705?, 0xd57b60?, 0xd57b60?})
	/home/runner/go/pkg/mod/github.com/benbjohnson/[email protected]/clock.go:179 +0x1f2
github.com/benbjohnson/clock.(*Mock).Add(0xc0001b23c0, 0x77359400)
	/home/runner/go/pkg/mod/github.com/benbjohnson/[email protected]/clock.go:101 +0x9e
github.com/TBD54566975/ftl/backend/controller/cronjobs.testServiceWithDal({0xa8e1e8, 0xc0000246e0}, 0xc0000bc9c0, {0xa8f4b0, 0xc00021e080}, {0xa90f30, 0xc0001b23c0})
	/home/runner/work/ftl/ftl/backend/controller/cronjobs/cronjobs_utils_test.go:248 +0x6de
github.com/TBD54566975/ftl/backend/controller/cronjobs.TestServiceWithMockDal(0xc0000bc9c0)
	/home/runner/work/ftl/ftl/backend/controller/cronjobs/cronjobs_test.go:38 +0x237
testing.tRunner(0xc0000bc9c0, 0xa0e470)
	/home/runner/.cache/hermit/pkg/go-1.22.2/src/testing/testing.go:1[68](https://github.com/TBD54566975/ftl/actions/runs/8903975228/job/24452548713?pr=1360#step:7:69)9 +0x21f
created by testing.(*T).Run in goroutine 1
	/home/runner/.cache/hermit/pkg/go-1.22.2/src/testing/testing.go:1742 +0x826

While a bunch of other goroutines are blocked trying to access clock.Now() but can't get a lock (internal to benbjohnson/clock)

From clock.go

// This should only be called from a single goroutine at a time.
func (m *Mock) Add(d time.Duration) {

We adhere to this fine, this is the only place we are advancing the clock in this test: cronjobs_utils_test.go#L248
And I can't see any other notes of things to be careful about with multiple go routines

I think what is happening is:
(prefixing each step with its goroutine, either :

  • A: the goroutine the test is running in)
  • B: the goroutine that cronjobs.go is watching in

Steps:

  • B: cronjobs.go waits for signals from the mock clock, among other things
  • A: We try to increment the mock clock
    • it gets a lock on the mock clock's mutex
  • B: [race condition] cronjobs.go tries to access Now(), blocks because it can't get a lock
  • A: the mock clock tries to push the new time to the channel clock.go#L353
    • This blocks the mock clock because the listener is not ready to start receiving
  • [A+B] ... deadlock

I haven't got a solid footing in goroutines/channels yet, so I may have missed something

Also FYI, the clock library has been archived for about a year now:

matt2e added a commit that referenced this issue May 1, 2024
To unblock main branch from failing CI with this issue:
#1368
@matt2e
Copy link
Collaborator

matt2e commented May 10, 2024

Closing for now. Test is skipped, and integration test tests the same thing with the real db

@matt2e matt2e closed this as completed May 10, 2024
matt2e added a commit that referenced this issue May 13, 2024
This might make the test more reliable. fixes
#1446
We may have been hitting a deadlock situation where cronjobs were
setting up their observation of channels and calling `clock.Now()` while
the test tried to increment the mock clock (exact case is detailed more
in #1368).
@gak
Copy link
Contributor

gak commented May 15, 2024

Just saw this on my PR. I restarted the job which worked. https://github.com/TBD54566975/ftl/actions/runs/9090895040/attempts/1

wesbillman pushed a commit that referenced this issue Jul 18, 2024
fixes #2092
- mock clock was causing the same race conditions as this issue:
#1368 (comment)
- instead we are now wrapping the ASM provider with a manual sync
provider
- This wrapper allows us to trigger when syncs should happen in tests,
and block until sync completes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants