[WIP] feat(serving): light weighted traffic control for inference #184

ccchenjiahuan · 2021-09-26T16:10:07Z

Ⅰ. Describe what this PR does

Functions currently implemented:

istio-less ingress gateway
kubedl serving reconcile logic modification

I have done some tests on the ingress gateway part, but haven't got a chance to test the changes with kubedl, I think we can review the gateway part first.

Functions currently known but not yet implemented:

The ingress gateway watches all ingresses currently, it needs to be configured to watch specific inference cr related ingresses
Create a specific service account for the ingress gateway pod in the serving reconciliation

II. Does this pull request fix one issue?

resolves #160

codecov-commenter · 2021-09-26T16:13:47Z

Codecov Report

Merging #184 (1bdb54e) into master (f408cc8) will increase coverage by 0.88%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #184      +/-   ##
==========================================
+ Coverage   21.74%   22.62%   +0.88%     
==========================================
  Files          73       75       +2     
  Lines        4374     4557     +183     
==========================================
+ Hits          951     1031      +80     
- Misses       3292     3392     +100     
- Partials      131      134       +3

Flag	Coverage Δ
unittests	`22.62% <0.00%> (+0.88%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
controllers/serving/inference_controller.go	`0.00% <0.00%> (ø)`
controllers/serving/utils.go	`0.00% <0.00%> (ø)`
controllers/tensorflow/tfjob_controller.go	`44.77% <0.00%> (-2.24%)`	⬇️
pkg/job_controller/job.go	`13.61% <0.00%> (ø)`
controllers/mars/marsjob_controller.go	`0.00% <0.00%> (ø)`
controllers/xgboost/xgboostjob_controller.go	`0.00% <0.00%> (ø)`
pkg/gang_schedule/batch_scheduler/scheduler.go	`68.88% <0.00%> (ø)`
pkg/gang_schedule/volcano_scheduler/scheduler.go	`68.88% <0.00%> (ø)`
pkg/gang_schedule/coscheduler/scheduler.go	`64.70% <0.00%> (+61.58%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f408cc8...1bdb54e. Read the comment docs.

Signed-off-by: ccchenjiahuan <[email protected]> feat(serving): add kubedl serving modifications Signed-off-by: ccchenjiahuan <[email protected]>

ccchenjiahuan · 2021-09-26T16:26:53Z

/cc @SimonCqk @jian-he

SimonCqk · 2021-09-30T07:37:00Z

controllers/serving/inference_controller.go

-	// 3) If inference serves multiple model version simultaneously and canary policy has been set,
+	//// 4) If inference serves multiple model version simultaneously and canary policy has been set,
+	//// serving traffic will be distributed with different ratio and routes to backend service.
+	//if len(inference.Spec.Predictors) > 1 {


remove if it is not needed anymore?

SimonCqk · 2021-10-01T16:04:36Z

controllers/serving/inference_controller.go

+			Namespace: inf.Namespace,
+		},
+		Spec: networkingv1beta1.IngressSpec{
+			Rules: []networkingv1beta1.IngressRule{


what's the entry host name of this ingress?

I think we may not need host name? The default is *, we distinguish the traffic by path

SimonCqk · 2021-10-01T16:08:18Z

traffic_control/caddy_module.go

+	return nil
+}
+
+func (tc TrafficControl) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error {


is receiver(tc) of ServeHTTP method intend to be a non-pointer receiver?

yes, l wrote this plugin according to https://caddyserver.com/docs/extending-caddy

SimonCqk · 2021-10-01T16:10:28Z

traffic_control/caddy_module.go

@@ -0,0 +1,135 @@
+package istio_less_ingress_controller


looks good at first glance, and it will be nicer if you add more comments :)

SimonCqk · 2021-10-01T16:11:10Z

traffic_control/caddy_module.go

+}
+
+// parseCaddyfile unmarshals tokens from h into a new Middleware.
+func parseCaddyfile(h httpcaddyfile.Helper) (caddyhttp.MiddlewareHandler, error) {


it seems actually a no-op?

Yes, nothing is done right now, but keep it so that there are some parameters could be added later.

SimonCqk · 2021-10-01T16:17:16Z

traffic_control/ingress.go

+type ingressCache struct {
+	mutex sync.Mutex
+
+	hostToIngressEntry map[string][]ingressEntry


hostToIngressEntry caches all potential hosts in-cluster and maps to a list of imgress entries (each for a predcitor), so the istio-less ingress controller seems to be a cluster scope deployment? however, each Inference service creates it, it is expected?

I listed this point in the todo list above, it needs to be configured to watch specific ingresses related to each predcitor

jian-he · 2021-10-04T20:58:33Z

controllers/serving/inference_controller.go

@@ -335,6 +430,72 @@ func (ir *InferenceReconciler) syncServiceForInference(inf *servingv1alpha1.Infe
 	return nil
 }

+// syncIstioLessIngressGateway sync the istio-less ingress gateway for inference service.
+func (ir *InferenceReconciler) syncIstioLessIngressGateway(inf *servingv1alpha1.Inference) error {


it's better to renam to caddyIngressGateway

jian-he · 2021-10-04T20:59:11Z

controllers/serving/inference_controller.go

+		return err
+	}
+	svcExists := true
+	if err != nil && errors.IsNotFound(err) {


these two if conditions can be combined into a single if errr != nil { xxx } block

jian-he · 2021-10-04T21:03:41Z

controllers/serving/inference_controller.go

+		return ir.client.Create(context.Background(), &gateway)
+	}
+
+	if !reflect.DeepEqual(gateway.Spec, gatewayInCluster.Spec) {


I think we should not reset the spec every time, in case we need to manually change something.

I don’t quite understand your meaning here, the trigger condition here is manually changing the spec

jian-he · 2021-10-04T21:05:52Z

controllers/serving/inference_controller.go

+	igExists := true
+	igName := genPredictorName(inf, predictor)
+	err := ir.client.Get(context.Background(), types.NamespacedName{Namespace: inf.Namespace, Name: igName}, &igInCluster)
+	if err != nil && !errors.IsNotFound(err) {


similarly, these two if conditions can be combined into a single if err != nil {xxx}. block

jian-he · 2021-10-04T21:15:14Z

controllers/serving/utils.go

@@ -22,6 +22,12 @@ import (
 	"github.com/alibaba/kubedl/apis/serving/v1alpha1"
 )

+const (
+	CANARY_WEIGHT  = "kubedl.kubernetes.io/canary-weight"


name it kubedl.io to be consistent

jian-he · 2021-10-04T21:28:16Z

controllers/serving/inference_controller.go

+//  |---request---> VirtualService
+//                        |--- 90% ---> Deploy-Of-Model-A.1
+//                        |--- 10% ---> Deploy-Of-Model-B.1
+func (ir *InferenceReconciler) syncPredictorTrafficDistribution(inf *servingv1alpha1.Inference, index int, predictor *servingv1alpha1.PredictorSpec) error {


can we make it pluggable such that both istio and caddy based traffic control can be configured. @SimonCqk

yes, I think that would be better, it may require a global configuration

jian-he · 2021-10-04T21:30:55Z

traffic_control/consts.go

+package istio_less_ingress_controller
+
+const (
+	CANARY_WEIGHT = "kubedl.kubernetes.io/canary-weight"


this const is duplicated in utils.go. this file can be removed

jian-he · 2021-10-04T21:33:01Z

traffic_control/go.mod

+	k8s.io/apimachinery v0.20.7
+	k8s.io/client-go v0.20.7
+	k8s.io/klog v1.0.0
+	k8s.io/utils v0.0.0-20210820185131-d34e5cb4466e // indirect


we shouldn't have another go.mod file in a sub package ?

Actually I think it better to be placed under a separate repo. but anyway if it is placed under the kubedl, l think it's ok because this service is more like a plugin, it should not pollute the main go.mod

jian-he · 2021-10-04T21:34:03Z

traffic_control/go.mod

+
+go 1.15
+
+require (


should the traffic_control package located under pkg/, instead of the root? @SimonCqk

jian-he · 2021-10-04T21:35:49Z

controllers/serving/inference_controller.go

@@ -110,7 +113,13 @@ func (ir *InferenceReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error)
 		return ctrl.Result{}, err
 	}

-	// 2) Sync each predictor to deploy containers mounted with specific model.
+	// 2) Sync istio-less ingress gateway
+	if err = ir.syncIstioLessIngressGateway(&inference); err != nil {


don't call "istio-less" , call it caddyIngress to be explicit, and also the docker image name

ccchenjiahuan force-pushed the feat/traffic-control branch 2 times, most recently from 1f6a65f to 99e0999 Compare September 26, 2021 16:12

ccchenjiahuan force-pushed the feat/traffic-control branch from 99e0999 to 90a926b Compare September 26, 2021 16:14

feat(serving): add istio less ingress gateway

1bdb54e

Signed-off-by: ccchenjiahuan <[email protected]> feat(serving): add kubedl serving modifications Signed-off-by: ccchenjiahuan <[email protected]>

ccchenjiahuan force-pushed the feat/traffic-control branch from 90a926b to 1bdb54e Compare September 26, 2021 16:22

ccchenjiahuan changed the title ~~feat(serving): light weighted traffic control for inference~~ [WIP] feat(serving): light weighted traffic control for inference Sep 26, 2021

SimonCqk reviewed Oct 1, 2021

View reviewed changes

jian-he reviewed Oct 4, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(serving): light weighted traffic control for inference #184

[WIP] feat(serving): light weighted traffic control for inference #184

ccchenjiahuan commented Sep 26, 2021 •

edited

Loading

codecov-commenter commented Sep 26, 2021 •

edited

Loading

ccchenjiahuan commented Sep 26, 2021

SimonCqk Sep 30, 2021

ccchenjiahuan Oct 8, 2021

SimonCqk Oct 1, 2021

ccchenjiahuan Oct 8, 2021 •

edited

Loading

SimonCqk Oct 1, 2021

ccchenjiahuan Oct 8, 2021

SimonCqk Oct 1, 2021

SimonCqk Oct 1, 2021

ccchenjiahuan Oct 8, 2021

SimonCqk Oct 1, 2021

ccchenjiahuan Oct 8, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

ccchenjiahuan Oct 8, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

ccchenjiahuan Oct 8, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

ccchenjiahuan Oct 8, 2021

jian-he Oct 4, 2021

jian-he Oct 4, 2021

SimonCqk Oct 5, 2021

ccchenjiahuan Oct 8, 2021

[WIP] feat(serving): light weighted traffic control for inference #184

Are you sure you want to change the base?

[WIP] feat(serving): light weighted traffic control for inference #184

Conversation

ccchenjiahuan commented Sep 26, 2021 • edited Loading

Ⅰ. Describe what this PR does

II. Does this pull request fix one issue?

codecov-commenter commented Sep 26, 2021 • edited Loading

Codecov Report

ccchenjiahuan commented Sep 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccchenjiahuan Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccchenjiahuan commented Sep 26, 2021 •

edited

Loading

codecov-commenter commented Sep 26, 2021 •

edited

Loading

ccchenjiahuan Oct 8, 2021 •

edited

Loading