Add Cluster API deployment method for TAS #108

criscola · 2022-09-23T12:08:21Z

This PR adds a new deployment method for the Telemetry Aware Scheduler using Cluster API.

It would be great if someone could try it out and let me know if there's any issue/suggestion to improve it. The deployment was tested on Kubernetes v1.25 with GCP provider.

criscola · 2022-11-04T13:27:31Z

I've seen that in the meantime you uploaded a CONTRIBUTORS.md. It asks to sign-off using our real name, but I signed-off using my pseudonym "criscola". Should I amend/resubmit the PR?

madalazar · 2022-11-04T16:29:29Z

Hi!
If it's not too much of a bother, I would prefer if you were to amend the commits to use your full name and then just update the PR.
Thank you!

criscola · 2022-11-10T13:41:48Z

FYI I stumbled upon CAPD, or CAPI in Docker. It's different because it uses the new ClusterClass resource, so requires a few different steps. Since I'm tasked to deploy it locally, I will also submit my findings here. This is nice because it doesn't require users to spin up instances in providers (and spend money), so it becomes more accessible to deploy the TAS for local testing.

Avoids failures when applying the TAS Service Account through ClusterResourceSet. Relevant error message was: "failed to create object /v1, Kind=ServiceAccount /telemetry-aware-scheduling-service-account: an empty namespace may not be set during creation". Signed-off-by: Cristiano Colangelo <[email protected]>

Signed-off-by: Cristiano Colangelo <[email protected]>

Signed-off-by: Madalina Lazar <[email protected]>

criscola · 2022-11-18T13:24:20Z

I amended my commits and signed-off with my name. Let me know if there is anything else. I managed to make CAPD work as well but discovered something that will likely require a PR on CAPD side, so I'm holding it off for now as I could be adding outdated information by then.

Signed-off-by: Madalina Lazar <[email protected]>

madalazar · 2022-11-23T09:30:51Z

Hi,
Everything looks ok, from what I could see. Just wanted to also give you an update, we started looking into this PR and between this and other small things, we plan on working and this PR this year.
I will keep you updated as we work through this.
Thank you!

criscola · 2023-01-03T14:11:24Z

Quick update: I will add also the instructions on the CAPD (Cluster API Docker) deployment for local testing/development soon.

criscola · 2023-01-17T13:54:49Z

Hi! I added the specific notes for a Docker deployment, it's very similar to a deployment on a generic provider like GCP, but there are some caveats because it uses an experimental feature called ClusterClass, plus a few steps for the local KinD cluster setup/kubeconfig export. Let me know if there is any comment on this. I'm always using the Docker provider now for local development (saves us a few bucks and it's also a nicer user experience as it's faster).

madalazar

We had issues setting up TAS with cluster API after following this wiki. There are one/two issues that might be the problem (i.e. naming of capi-quickstart.yaml / your-manifest.yaml).
The rest of these comments are regarding the readability/structure of the wiki.

telemetry-aware-scheduling/deploy/cluster-api/README.md

telemetry-aware-scheduling/deploy/cluster-api/docker/capi-docker.md

telemetry-aware-scheduling/deploy/cluster-api/docker/kubeadmcontrolplanetemplate-patch.yaml

telemetry-aware-scheduling/deploy/cluster-api/docker/capi-docker.md

madalazar · 2023-01-24T11:36:51Z

telemetry-aware-scheduling/deploy/cluster-api/docker/capi-docker.md

+ClusterResourceSets resources are already given to you in `clusterresourcesets.yaml`.
+Apply them to the management cluster with `kubectl apply -f clusterresourcesets.yaml`


Does this refer to docker/clusterresourcesets.yaml? And for generic, it also refers to it's own clusterresourcesets.yaml

The CRS resources are actually the same. Do you think it would make more sense to have only one file for both vs duplicating it, if yes any idea where to put the CRS file in the folder tree?
Edit: I created a shared folder and referenced the common resources in generic and docker guides. Should be more maintainable. If they go out of sync we can always move them out of the shared folder.

I think having the shared folder and just referencing the files is a good idea. Do we still need the clusterresourcessets.yaml file in both (docker, generic) folders?

Nope I think we can just have it in the shared folder, it's also linked in both guides so users should find it easily.

Add Calico CRS resources for simplicity. Update docs relative to Calico CRS.

Move cluster-patch.yaml to shared folder and rename docs.

Forgot to rename capi-quickstart somewhere.

…d line.

criscola · 2023-01-27T16:27:00Z

@madalazar I addressed all the comments - thanks for the detailed feedback. I also improved a few parts/solved a few problems. Feel free to continue with the review.

madalazar · 2023-01-31T17:30:14Z

telemetry-aware-scheduling/deploy/cluster-api/generic/capi.md

+  > capi-quickstart.yaml
+```
+
+If Kind was running correctly, and the Docker provider was initialized with the previous command, the command will return nothing to indicate success.


I don't think this is needed, seems like a remnant from the Docker/Kind installation

This would be something I added after this comment, should it be removed?

madalazar · 2023-01-31T17:44:49Z

I want to go through the installation steps in the Readme one last time just to check that everything is in order. I'm planning to do that this week/Tuesday the latest. I will update the PR

madalazar · 2023-03-09T15:46:11Z

@criscola HI, sorry it took so long for me to get back to you with a reply. Between then and now I've been looking at the security side of this solution (the new components that are brought in are they secure(d), if not what's missing, the infrastructure that we are creating is that secure etc.).

The biggest problem that I found (and that is because I have been able to test only the CAPI Docker provider part of the solution) is with a component that CAPD is introducing: capi-quickstart-lb. This component opens up a port to the outside world on the host that it's installed on and even though the connection is secure, the cyphers use are out-of-date.
I cut kubernetes-sigs/cluster-api#8245 to the CAPI team to see if I they would fix it or propose other solutions. This is still WIP.

There are 5 more issues, but I think they should be more manageable to address:

is there a way we could download the calico configmap (https://github.com/intel/platform-aware-scheduling/pull/108/files#diff-b69316303b6acb891b357086803125c4df86d7ce4b64715ab036136d01d4910a) at the time of install and not have to store it locally? Similar to: https://github.com/intel/platform-aware-scheduling/blob/master/.github/scripts/e2e_setup_cluster.sh#L240
4 of the configmaps generated in steps 5,6 from the capi-docker.md https://github.com/intel/platform-aware-scheduling/pull/108/files#diff-16e3273b82cf8341146211ba8f104c6effa9b8de304e9ab79310f75630a7e8d1 have issues:

ConfigMap/custom-metrics-configmap
ConfigMap/custom-metrics-tls-secret-configmap
ConfigMap/tas-configmap
ConfigMap/tas-tls-secret-configmap

With most of the issues being the presence of 'secret', 'key' in the config maps:

HIGH: ConfigMap 'custom-metrics-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"        secret"}'
HIGH: ConfigMap 'custom-metrics-tls-secret-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"  tls.crt", "  tls.key"}'
HIGH: ConfigMap 'custom-metrics-tls-secret-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"custom-metrics-tls-secret.yaml"}'
HIGH: ConfigMap 'tas-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"                  - key", "            - -key", "         key"}'
HIGH: ConfigMap 'tas-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"          secret"}'
HIGH: ConfigMap 'tas-tls-secret-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"  tls.crt", "  tls.key"}'
HIGH: ConfigMap 'tas-tls-secret-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"tas-tls-secret.yaml"}'

For this particular issues, could we instead of config maps just install the components as they come:

helm install node-exporter deploy/charts/prometheus_node_exporter_helm_chart/
helm install prometheus deploy/charts/prometheus_helm_chart/
--- create key ----
--- create tls secret from key ----
helm install prometheus-adapter deploy/charts/prometheus_custom_metrics_helm_chart/
--- configure scheduler  using configuration script ---
--- create extender secret  from key ----
kubectl apply -f deploy/
kubectl apply -f ../shared/clusterresourcesets.yaml`
kubectl apply -f capi-quickstart.yaml

I'm in the process of releasing a new version of TAS and tried to get this PR added to it, but because of these issues it might not happen.

Let me know what you think of my suggestions and apologies again for the delay,
Madalina

criscola · 2023-03-10T08:45:57Z

Hi! Yeah I see about the docker container. Keep in mind this is not supposed to be used for any production release, only development, so I wouldn't worry about it (but good you opened an issue, let's see if anything moves there).

For the configmaps, those are necessary if you wanted to automate the deployment through CRS resources. The user could gitignore them if it wants. Not sure if I would remove them since the purpose of this is a bit to have everything in declarative configuration. Moreover I put a disclaimer on the docs to tell the user to not commit TLS secrets to their repository as it's bad security practice.

madalazar · 2023-03-21T12:29:44Z

Hi, here are a couple of updates from my side:

regarding the ciphers, I have a possible workaround( from [CAPI Docker Provider]CAPI-LB component has out-of-date ciphers kubernetes-sigs/cluster-api#8245) . this is not 100% mitigated as I still need to understand what are the correct ciphers to use and to check if it's secure to make the change to the "podSecurityStandard" property.

In the capi-quickstart.yaml, we need to make the following changes:

- pass the ciphers through the "extraArgs" parameter of the KubeadmControlPlaneTemplate component

   256         clusterConfiguration:
    257           apiServer:
    258             certSANs:
    259               - localhost
    260               - 127.0.0.1
    261               - 0.0.0.0
    262               - host.docker.internal
    263             extraArgs:
    264               tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
    265           controllerManager:

- update the podSecurityStandard property of the Cluster component to "enabled: false"

364  apiVersion: cluster.x-k8s.io/v1beta1
365  kind: Cluster
...
    381  topology:
    382     class: quick-start
    383     controlPlane:
    384       metadata: {}
    385       replicas: 1
    386     variables:
    387       - name: imageRepository
    388         value: ""
    389       - name: etcdImageTag
    390         value: ""
    391       - name: coreDNSImageTag
    392         value: ""
    393       - name: podSecurityStandard
    394         value:
    395           audit: restricted
    396           enabled: false
    397           enforce: baseline
    398           warn: restricted
...

we need to reference the calico-configmap.yaml via URL instead of committing the file into the repo. I don't think this repo, and its owners, should be responsible to patch it, keep it up-to-date etc. . If we have a URL we can just run kubectl apply on it, we did something similar here : https://github.com/intel/platform-aware-scheduling/blob/master/.github/scripts/e2e_setup_cluster.sh#L13.
I thought about this for a while and I think it's best for us to not use config maps to deploy TAS at all. We don't support it and don't have plans to do so in the near future. Instead, I would like to replace steps 5 and 6 with links to https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/custom-metrics.md#quick-install and https://github.com/intel/platform-aware-scheduling/tree/master/telemetry-aware-scheduling#deploy-tas.

Apologies for the delay,
Madalina

criscola · 2023-04-20T08:57:36Z

Hi Madalina, thanks for your answer. Sorry I'm a bit tied up with another task right now but as soon as I'm done with it I want to address your points/update the PR. Thanks for the patience.

criscola · 2023-05-24T07:26:03Z

Hi Madalina, one question about your last point:

I thought about this for a while and I think it's best for us to not use config maps to deploy TAS at all. We don't support it and don't have plans to do so in the near future. Instead, I would like to replace steps 5 and 6 with links to https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/custom-metrics.md#quick-install and https://github.com/intel/platform-aware-scheduling/tree/master/telemetry-aware-scheduling#deploy-tas.

In my opinion, this defeats a bit the purpose of the Cluster API deployment option, because it prevents a fully automatic deployment of the TAS. When we use ClusterResourceSet, we effectively store the resources to deploy in ConfigMap which will then used to deploy to workload clusters from the management cluster. Everything is then stored in git, in a true GitOps way. There are other ways to do this, but since CRS are from Kubernetes SIG, this seems a good default method to include here, in our case, we use ArgoCD to manage Prometheus/Adapter but we started from CRS.

Is the concern mainly with an increased maintenance burden on the contributors' side? As long as helm charts are used, I believe the user will always need to render the resources in the same way, so there should be no additional work to be done in the future on our side. Of course if something changes to the other configs (like the scheduler's) then it needs to be changed in the CAPI manifests but the change should be well localized.

criscola force-pushed the feature/cluster-api branch 2 times, most recently from 903efc3 to 63ea8d3 Compare November 18, 2022 11:16

criscola and others added 3 commits November 18, 2022 14:06

Add Cluster API deployment method

c890226

Signed-off-by: Cristiano Colangelo <[email protected]>

Adding code_of_conduct and contributing readme file

4d7d6df

Signed-off-by: Madalina Lazar <[email protected]>

criscola force-pushed the feature/cluster-api branch from a15fd67 to 4d7d6df Compare November 18, 2022 13:06

Merge branch 'master' into feature/cluster-api

d6f904b

madalazar added a commit to madalazar/platform-aware-scheduling that referenced this pull request Nov 22, 2022

Remove patching with null value for descheduling strategy (intel#108)

4a40e7e

Signed-off-by: Madalina Lazar <[email protected]>

criscola and others added 2 commits January 17, 2023 13:48

Merge branch 'master' into feature/cluster-api

050de7f

Add Docker CAPI deployment specific guide

19026c4

Add ClusterResourceSets for CAPD deployment

f44598a

madalazar reviewed Jan 24, 2023

View reviewed changes

criscola added 11 commits January 27, 2023 11:25

Move CRS to 'shared' folder.

6be7648

Add Calico CRS resources for simplicity. Update docs relative to Calico CRS.

Update link to Health Metric Example.

fd030d4

Rename your-manifests.yaml to capi-quickstart.yaml

3badb05

Fix numbering in markdown.

57ff014

Add yaml newlines.

1bb8999

Move cluster-patch.yaml to shared folder and rename docs.

Add testing/development notice in all markdowns.

e41d190

Move generic/docker provider links to top.

fb752e1

Add Docker and Kind versions.

eaf3e7c

Add small comment after clusterctl generate.

2fe40df

Add necessary feature flags.

1365e59

Update paths of commands referencing the Helm chart.

d3cd12c

Forgot to rename capi-quickstart somewhere.

criscola added 5 commits January 27, 2023 16:21

Add yq commands to wrangle with the various resources with the comman…

0891f21

…d line.

Reformat docs.

2d08d1e

Add a few more links to files/folders.

299571e

Add note on how to initialize Kind cluster in Docker provider.

ed3d300

More adjustments.

8f98dd6

madalazar reviewed Jan 31, 2023

View reviewed changes

madalazar mentioned this pull request Mar 9, 2023

should modify kubeadm config rather than edit the kube-scheduler static manifest #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cluster API deployment method for TAS #108

Add Cluster API deployment method for TAS #108

criscola commented Sep 23, 2022

criscola commented Nov 4, 2022

madalazar commented Nov 4, 2022 •

edited

Loading

criscola commented Nov 10, 2022

criscola commented Nov 18, 2022 •

edited

Loading

madalazar commented Nov 23, 2022

criscola commented Jan 3, 2023

criscola commented Jan 17, 2023

madalazar left a comment

madalazar Jan 24, 2023

criscola Jan 26, 2023 •

edited

Loading

madalazar Jan 31, 2023 •

edited

Loading

criscola Feb 1, 2023

criscola commented Jan 27, 2023

madalazar Jan 31, 2023

criscola Feb 1, 2023

madalazar commented Jan 31, 2023

madalazar commented Mar 9, 2023

criscola commented Mar 10, 2023 •

edited

Loading

madalazar commented Mar 21, 2023

criscola commented Apr 20, 2023 •

edited

Loading

criscola commented May 24, 2023

		ClusterResourceSets resources are already given to you in `clusterresourcesets.yaml`.
		Apply them to the management cluster with `kubectl apply -f clusterresourcesets.yaml`

Add Cluster API deployment method for TAS #108

Are you sure you want to change the base?

Add Cluster API deployment method for TAS #108

Conversation

criscola commented Sep 23, 2022

criscola commented Nov 4, 2022

madalazar commented Nov 4, 2022 • edited Loading

criscola commented Nov 10, 2022

criscola commented Nov 18, 2022 • edited Loading

madalazar commented Nov 23, 2022

criscola commented Jan 3, 2023

criscola commented Jan 17, 2023

madalazar left a comment

Choose a reason for hiding this comment

madalazar Jan 24, 2023

Choose a reason for hiding this comment

criscola Jan 26, 2023 • edited Loading

Choose a reason for hiding this comment

madalazar Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

criscola Feb 1, 2023

Choose a reason for hiding this comment

criscola commented Jan 27, 2023

madalazar Jan 31, 2023

Choose a reason for hiding this comment

criscola Feb 1, 2023

Choose a reason for hiding this comment

madalazar commented Jan 31, 2023

madalazar commented Mar 9, 2023

criscola commented Mar 10, 2023 • edited Loading

madalazar commented Mar 21, 2023

criscola commented Apr 20, 2023 • edited Loading

criscola commented May 24, 2023

madalazar commented Nov 4, 2022 •

edited

Loading

criscola commented Nov 18, 2022 •

edited

Loading

criscola Jan 26, 2023 •

edited

Loading

madalazar Jan 31, 2023 •

edited

Loading

criscola commented Mar 10, 2023 •

edited

Loading

criscola commented Apr 20, 2023 •

edited

Loading