Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner id collisions causes runner kill loop with LocalScaling deterministic runner IDs #1036

Closed
matt2e opened this issue Mar 7, 2024 · 0 comments · Fixed by #1038
Closed
Assignees

Comments

@matt2e
Copy link
Collaborator

matt2e commented Mar 7, 2024

Steps to repro

  • ftl serve --recreate --log-level=DEBUG
  • ftl deploy examples/go
  • ftl ps --verbose
  • ftl kill ___ <-- choose the deployment with a lower runner id, which is the time one
  • At this point there will be a remaining runner with an id like R00000000000000000000004000
  • ftl deploy examples/go/time
  • LocalScaling then creates a new runner, but chooses the id to be R00000000000000000000004000 (it's generated using the current number of runners)
  • The leads to the existing runner with that id to be killed to allow this one to start up. Which then causes another runner to be kicked off to redeploy the echo deployment, and the cycle starts to repeat
@github-actions github-actions bot added the triage Issue needs triaging label Mar 7, 2024
@alecthomas alecthomas mentioned this issue Mar 7, 2024
@alecthomas alecthomas changed the title Runner id collisions causes runner kill loop Runner id collisions causes runner kill loop with LocalScaling deterministic runner IDs Mar 7, 2024
@matt2e matt2e self-assigned this Mar 7, 2024
@github-actions github-actions bot removed the triage Issue needs triaging label Mar 7, 2024
matt2e added a commit that referenced this issue Mar 7, 2024
Adds `--idle-runners` arg to define how large the idle pool should be.

Fixes #1036
Fixes #1030

### Previous notes
Currently a draft because this PR makes #1036 more likely to be hit.
Before this change, killing all deployments would mean there are 0
runners, leading to no runner id collisions when you bring up more
deployments
After this change, killing all deployments means that there will still
be runners which will cause collisions if the idle runner ids aren't the
lowest possible [`R00000000000000000000002000`,
`R00000000000000000000004000` ... ]

I've been testing with a hacky fix replacing line
`bankend/controller/scaling/local_scaling.go:96` to:
```
binary.BigEndian.PutUint32(ulid[10:], rand.Uint32())
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant