-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change how runner provisioning works #2590
Conversation
6d53f5e
to
51787d3
Compare
11e5893
to
b34608a
Compare
f3e6b02
to
ef37e1a
Compare
4d38721
to
63a03d5
Compare
9a279de
to
5b85ce8
Compare
a81f292
to
87936e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Loving how much code this allowed us to delete!
@@ -1249,144 +1240,6 @@ func (s *Service) reapStaleRunners(ctx context.Context) (time.Duration, error) { | |||
return s.config.RunnerTimeout, nil | |||
} | |||
|
|||
// Release any expired runner deployment reservations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glorious!
@@ -627,7 +618,8 @@ func (s *Service) RegisterRunner(ctx context.Context, stream *connect.ClientStre | |||
|
|||
// Check if we can contact the runner. | |||
func (s *Service) pingRunner(ctx context.Context, endpoint *url.URL) error { | |||
client := rpc.Dial(ftlv1connect.NewRunnerServiceClient, endpoint.String(), log.Error) | |||
// TODO: do we really need to ping the runner first thing? We should revisit this later | |||
client := rpc.Dial(ftlv1connect.NewVerbServiceClient, endpoint.String(), log.Error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to ensure the controller can connect to the runner, even when the runner can connect to the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the case we talked about where the runner could potentially be behind a normal kube service and we let kube handle the load balancing. I will delete the comment though, that is future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we definitely should revisit that if we do go down that route.
// '{"languages": ["go", "kotlin"], "os": "linux", "arch": "amd64", "pid": 1234}' | ||
// | ||
// If no runners are available, it will return an empty slice. | ||
func (d *DAL) GetIdleRunners(ctx context.Context, limit int, labels model.Labels) ([]Runner, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So good deleting all this code 🤗
@@ -75,13 +75,9 @@ WITH deployment_rel AS ( | |||
-- otherwise we try to retrieve the deployments.id using the key. If | |||
-- there is no corresponding deployment, then the deployment ID is -1 | |||
-- and the parent statement will fail due to a foreign key constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs to be updated to reflect reality I think?
@@ -110,8 +105,7 @@ FROM matches; | |||
-- name: DeregisterRunner :one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that runners are ephemeral I think we'll want to periodically reap dead runners from the database. Will create a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I was wondering is if the states are really relevant any more, although I left them in for now as the refactor was getting pretty big. Basically with ephemeral runners is there any advantage to tracking 'dead' runners? In theory a runner is either there or it is not, so maybe we just have the runners table track the runners actually connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right that we can reduce them, but I think there will still need to be at least two maybe? "NEW" while the runner is preparing to serve traffic and "READY" where it's fully deployed the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pull the content and get ready before the runner registers itself, so the controller only knows about runners that are actually ready to serve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that could work!
I do vaguely recollect that the DEAD state was to keep track of runners that the controller lost, but I now can't remember why I thought that was important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still work to do around kube health checks as well, this approach of pull the content before you register would probably work better for that as well. We could not mark the pod ready until the deployment is ready to go.
backend/controller/scaling/k8sscaling/deployment_provisioner.go
Outdated
Show resolved
Hide resolved
2f2bd32
to
bfefcaf
Compare
00f5ec4
to
d9ac655
Compare
This changes the way runners are provisioned, and how runners are allocated. Runners are now spawned knowing exactly which kube deployment they are for, and will always immedatly download and run that deployment. For kubernetes environments replicas are controlled by creating a kube deployment for each FTL deployment, and adjusting the number of replicas. For local scaling we create the runners directly for deployments as required. This also introduces an initial kubernetes test. fixes: #2449 #2276
75e76ba
to
502d0ef
Compare
This changes the way runners are provisioned, and how runners are allocated. Runners are
now spawned knowing exactly which kube deployment they are for, and will always
immedatly download and run that deployment.
For kubernetes environments replicas are controlled by creating a kube deployment
for each FTL deployment, and adjusting the number of replicas.
For local scaling we create the runners directly for deployments as required.
This also introduces an initial kubernetes test.