feat: change how runner provisioning works #2590

stuartwdouglas · 2024-09-03T21:58:57Z

This changes the way runners are provisioned, and how runners are allocated. Runners are
now spawned knowing exactly which kube deployment they are for, and will always
immedatly download and run that deployment.

For kubernetes environments replicas are controlled by creating a kube deployment
for each FTL deployment, and adjusting the number of replicas.

For local scaling we create the runners directly for deployments as required.

This also introduces an initial kubernetes test.

scripts/tests/test-kube-deployment.sh

alecthomas

Awesome. Loving how much code this allowed us to delete!

alecthomas · 2024-09-04T07:45:37Z

backend/controller/controller.go

@@ -1249,144 +1240,6 @@ func (s *Service) reapStaleRunners(ctx context.Context) (time.Duration, error) {
 	return s.config.RunnerTimeout, nil
 }

-// Release any expired runner deployment reservations.


alecthomas · 2024-09-04T07:45:41Z

backend/controller/controller.go

@@ -627,7 +618,8 @@ func (s *Service) RegisterRunner(ctx context.Context, stream *connect.ClientStre

 // Check if we can contact the runner.
 func (s *Service) pingRunner(ctx context.Context, endpoint *url.URL) error {
-	client := rpc.Dial(ftlv1connect.NewRunnerServiceClient, endpoint.String(), log.Error)
+	// TODO: do we really need to ping the runner first thing? We should revisit this later
+	client := rpc.Dial(ftlv1connect.NewVerbServiceClient, endpoint.String(), log.Error)


This is to ensure the controller can connect to the runner, even when the runner can connect to the controller.

I was thinking about the case we talked about where the runner could potentially be behind a normal kube service and we let kube handle the load balancing. I will delete the comment though, that is future work.

Yeah we definitely should revisit that if we do go down that route.

alecthomas · 2024-09-04T07:46:59Z

backend/controller/dal/dal.go

-// '{"languages": ["go", "kotlin"], "os": "linux", "arch": "amd64", "pid": 1234}'
-//
-// If no runners are available, it will return an empty slice.
-func (d *DAL) GetIdleRunners(ctx context.Context, limit int, labels model.Labels) ([]Runner, error) {


So good deleting all this code 🤗

alecthomas · 2024-09-04T07:47:57Z

backend/controller/dal/internal/sql/queries.sql

@@ -75,13 +75,9 @@ WITH deployment_rel AS (
 -- otherwise we try to retrieve the deployments.id using the key. If
 -- there is no corresponding deployment, then the deployment ID is -1
 -- and the parent statement will fail due to a foreign key constraint.


Comment needs to be updated to reflect reality I think?

alecthomas · 2024-09-04T07:49:37Z

backend/controller/dal/internal/sql/queries.sql

@@ -110,8 +105,7 @@ FROM matches;
 -- name: DeregisterRunner :one


Now that runners are ephemeral I think we'll want to periodically reap dead runners from the database. Will create a ticket.

Something I was wondering is if the states are really relevant any more, although I left them in for now as the refactor was getting pretty big. Basically with ephemeral runners is there any advantage to tracking 'dead' runners? In theory a runner is either there or it is not, so maybe we just have the runners table track the runners actually connected?

I think you're right that we can reduce them, but I think there will still need to be at least two maybe? "NEW" while the runner is preparing to serve traffic and "READY" where it's fully deployed the module.

We could pull the content and get ready before the runner registers itself, so the controller only knows about runners that are actually ready to serve.

Hmm yeah that could work!

I do vaguely recollect that the DEAD state was to keep track of runners that the controller lost, but I now can't remember why I thought that was important.

There is still work to do around kube health checks as well, this approach of pull the content before you register would probably work better for that as well. We could not mark the pod ready until the deployment is ready to go.

backend/controller/scaling/k8sscaling/deployment_provisioner.go

backend/controller/scaling/k8sscaling/k8s_scaling.go

backend/controller/scaling/scaling.go

backend/runner/runner.go

This changes the way runners are provisioned, and how runners are allocated. Runners are now spawned knowing exactly which kube deployment they are for, and will always immedatly download and run that deployment. For kubernetes environments replicas are controlled by creating a kube deployment for each FTL deployment, and adjusting the number of replicas. For local scaling we create the runners directly for deployments as required. This also introduces an initial kubernetes test. fixes: #2449 #2276

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 6d53f5e to 51787d3 Compare September 3, 2024 22:08

This was referenced Sep 3, 2024

JVM Runtime Roadmap #2439

Open

Dashboard #728

Open

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 11e5893 to b34608a Compare September 3, 2024 22:16

stuartwdouglas added the run-all A PR with this label will run the full set of CI jobs in the PR rather than in the merge queue label Sep 3, 2024

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 2 times, most recently from f3e6b02 to ef37e1a Compare September 3, 2024 23:13

stuartwdouglas added the skip-proto-breaking PRs with this label will skip the breaking proto check label Sep 3, 2024

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 4 times, most recently from 4d38721 to 63a03d5 Compare September 4, 2024 00:38

alecthomas reviewed Sep 4, 2024

View reviewed changes

scripts/tests/test-kube-deployment.sh Outdated Show resolved Hide resolved

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 9 times, most recently from 9a279de to 5b85ce8 Compare September 4, 2024 05:53

stuartwdouglas marked this pull request as ready for review September 4, 2024 05:54

stuartwdouglas requested review from a team and deniseli and removed request for a team September 4, 2024 05:54

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 2 times, most recently from a81f292 to 87936e0 Compare September 4, 2024 06:30

alecthomas approved these changes Sep 4, 2024

View reviewed changes

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 2f2bd32 to bfefcaf Compare September 4, 2024 11:26

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch 5 times, most recently from 00f5ec4 to d9ac655 Compare September 4, 2024 23:43

stuartwdouglas added 3 commits September 11, 2024 10:49

address review

029bc6d

add health check

502d0ef

stuartwdouglas force-pushed the stuartwdouglas/provisioning branch from 75e76ba to 502d0ef Compare September 11, 2024 00:50

stuartwdouglas added this pull request to the merge queue Sep 11, 2024

Merged via the queue into main with commit f605e23 Sep 11, 2024
89 checks passed

stuartwdouglas deleted the stuartwdouglas/provisioning branch September 11, 2024 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: change how runner provisioning works #2590

feat: change how runner provisioning works #2590

stuartwdouglas commented Sep 3, 2024 •

edited

Loading

alecthomas left a comment

alecthomas Sep 4, 2024

alecthomas Sep 4, 2024

stuartwdouglas Sep 4, 2024

alecthomas Sep 4, 2024

alecthomas Sep 4, 2024

alecthomas Sep 4, 2024

alecthomas Sep 4, 2024

stuartwdouglas Sep 4, 2024

alecthomas Sep 4, 2024

stuartwdouglas Sep 4, 2024

alecthomas Sep 4, 2024

stuartwdouglas Sep 4, 2024

		@@ -110,8 +105,7 @@ FROM matches;
		-- name: DeregisterRunner :one

feat: change how runner provisioning works #2590

feat: change how runner provisioning works #2590

Conversation

stuartwdouglas commented Sep 3, 2024 • edited Loading

alecthomas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartwdouglas commented Sep 3, 2024 •

edited

Loading