Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change how runner provisioning works #2590

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

stuartwdouglas
Copy link
Collaborator

@stuartwdouglas stuartwdouglas commented Sep 3, 2024

This changes the way runners are provisioned, and how runners are allocated. Runners are
now spawned knowing exactly which kube deployment they are for, and will always
immedatly download and run that deployment.

For kubernetes environments replicas are controlled by creating a kube deployment
for each FTL deployment, and adjusting the number of replicas.

For local scaling we create the runners directly for deployments as required.

This also introduces an initial kubernetes test.

@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 6d53f5e to 51787d3 Compare September 3, 2024 22:08
This was referenced Sep 3, 2024
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 11e5893 to b34608a Compare September 3, 2024 22:16
@stuartwdouglas stuartwdouglas added the run-all A PR with this label will run the full set of CI jobs in the PR rather than in the merge queue label Sep 3, 2024
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 2 times, most recently from f3e6b02 to ef37e1a Compare September 3, 2024 23:13
@stuartwdouglas stuartwdouglas added the skip-proto-breaking PRs with this label will skip the breaking proto check label Sep 3, 2024
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 4 times, most recently from 4d38721 to 63a03d5 Compare September 4, 2024 00:38
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 9 times, most recently from 9a279de to 5b85ce8 Compare September 4, 2024 05:53
@stuartwdouglas stuartwdouglas marked this pull request as ready for review September 4, 2024 05:54
@stuartwdouglas stuartwdouglas requested review from a team and deniseli and removed request for a team September 4, 2024 05:54
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 2 times, most recently from a81f292 to 87936e0 Compare September 4, 2024 06:30
Copy link
Collaborator

@alecthomas alecthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Loving how much code this allowed us to delete!

@@ -1249,144 +1240,6 @@ func (s *Service) reapStaleRunners(ctx context.Context) (time.Duration, error) {
return s.config.RunnerTimeout, nil
}

// Release any expired runner deployment reservations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glorious!

@@ -627,7 +618,8 @@ func (s *Service) RegisterRunner(ctx context.Context, stream *connect.ClientStre

// Check if we can contact the runner.
func (s *Service) pingRunner(ctx context.Context, endpoint *url.URL) error {
client := rpc.Dial(ftlv1connect.NewRunnerServiceClient, endpoint.String(), log.Error)
// TODO: do we really need to ping the runner first thing? We should revisit this later
client := rpc.Dial(ftlv1connect.NewVerbServiceClient, endpoint.String(), log.Error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure the controller can connect to the runner, even when the runner can connect to the controller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the case we talked about where the runner could potentially be behind a normal kube service and we let kube handle the load balancing. I will delete the comment though, that is future work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we definitely should revisit that if we do go down that route.

// '{"languages": ["go", "kotlin"], "os": "linux", "arch": "amd64", "pid": 1234}'
//
// If no runners are available, it will return an empty slice.
func (d *DAL) GetIdleRunners(ctx context.Context, limit int, labels model.Labels) ([]Runner, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So good deleting all this code 🤗

@@ -75,13 +75,9 @@ WITH deployment_rel AS (
-- otherwise we try to retrieve the deployments.id using the key. If
-- there is no corresponding deployment, then the deployment ID is -1
-- and the parent statement will fail due to a foreign key constraint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated to reflect reality I think?

@@ -110,8 +105,7 @@ FROM matches;
-- name: DeregisterRunner :one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that runners are ephemeral I think we'll want to periodically reap dead runners from the database. Will create a ticket.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I was wondering is if the states are really relevant any more, although I left them in for now as the refactor was getting pretty big. Basically with ephemeral runners is there any advantage to tracking 'dead' runners? In theory a runner is either there or it is not, so maybe we just have the runners table track the runners actually connected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that we can reduce them, but I think there will still need to be at least two maybe? "NEW" while the runner is preparing to serve traffic and "READY" where it's fully deployed the module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pull the content and get ready before the runner registers itself, so the controller only knows about runners that are actually ready to serve.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that could work!

I do vaguely recollect that the DEAD state was to keep track of runners that the controller lost, but I now can't remember why I thought that was important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still work to do around kube health checks as well, this approach of pull the content before you register would probably work better for that as well. We could not mark the pod ready until the deployment is ready to go.

backend/controller/scaling/k8sscaling/k8s_scaling.go Outdated Show resolved Hide resolved
backend/controller/scaling/k8sscaling/k8s_scaling.go Outdated Show resolved Hide resolved
backend/controller/scaling/scaling.go Show resolved Hide resolved
backend/runner/runner.go Outdated Show resolved Hide resolved
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 2f2bd32 to bfefcaf Compare September 4, 2024 11:26
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 5 times, most recently from 00f5ec4 to d9ac655 Compare September 4, 2024 23:43
This changes the way runners are provisioned, and how runners are allocated. Runners are
now spawned knowing exactly which kube deployment they are for, and will always
immedatly download and run that deployment.

For kubernetes environments replicas are controlled by creating a kube deployment
for each FTL deployment, and adjusting the number of replicas.

For local scaling we create the runners directly for deployments as required.

This also introduces an initial kubernetes test.

fixes: #2449 #2276
@stuartwdouglas stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 75e76ba to 502d0ef Compare September 11, 2024 00:50
@stuartwdouglas stuartwdouglas added this pull request to the merge queue Sep 11, 2024
Merged via the queue into main with commit f605e23 Sep 11, 2024
89 checks passed
@stuartwdouglas stuartwdouglas deleted the stuartwdouglas/provisioning branch September 11, 2024 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-all A PR with this label will run the full set of CI jobs in the PR rather than in the merge queue skip-proto-breaking PRs with this label will skip the breaking proto check
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants