Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Create private EKS cluster with "side" services (datadog, ACP, etc.) #4319

Open
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 17 comments
Open
Tracked by #4313

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

We need a private EKS cluster to run ci.jenkins.io container agents.

@dduportal dduportal changed the title Move "side" services to AWS [ci.jenkins.io] Create private EKS cluster and Move "side" services to AWS Sep 28, 2024
@dduportal dduportal changed the title [ci.jenkins.io] Create private EKS cluster and Move "side" services to AWS [ci.jenkins.io] Create private EKS cluster with "side" services (datadog, ACP, etc.) Sep 28, 2024
@dduportal dduportal added this to the infra-team-sync-2024-10-01 milestone Sep 28, 2024
@dduportal dduportal removed this from the infra-team-sync-2024-10-15 milestone Oct 14, 2024
@dduportal dduportal added this to the infra-team-sync-2024-10-29 milestone Oct 15, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Oct 15, 2024
@dduportal
Copy link
Contributor Author

Discussed with @smerle33:

@smerle33
Copy link
Contributor

change of usage for the module since last time we used it https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/UPGRADE-20.0.md

@smerle33
Copy link
Contributor

smerle33 commented Nov 4, 2024

We choose to deal with all the IAM usage within the private repository https://github.com/jenkins-infra/terraform-states/commit/cfd08c45dd4153d676c9223670f927d515585679
instead of giving the module user too much power.

@dduportal
Copy link
Contributor Author

Update: thanks to #4320 (comment), we now have an EKS cluster running!

This cluster is available through VPN (jenkins-infra/docker-openvpn#372 and jenkins-infra/jenkins-infra#3776)

It only has 1 node pool and no admin svc account yet though: these are next step before starting adding it to kubernetes management

@smerle33
Copy link
Contributor

smerle33 commented Dec 17, 2024

lets add cijenkinsio-agents-2 with minimum deployements and add them one by one with subsequent PR
first with docker-registry-secrets

change to only datadog

jenkins-infra/kubernetes-management#6020

@smerle33
Copy link
Contributor

smerle33 commented Dec 17, 2024

second step will be with jenkins-agents for "normal" and "bom" builds.

@smerle33
Copy link
Contributor

last step, which need more work, is the ACP,
we need to choose the implementation between

once the implementation set, we may have to stress test those differents implementations to fine tune the choice.

of course, we need what is stated in the issue body:

  • We also need the AWS LB Controller installed to allow creating a private LB for the Artifact Caching Proxy to ensure VM agents of ci.jenkins.io can reach the service

  • set up ACP to use a private LB and check access from the ci.jio controller or a VM agent through the private endpoint

@dduportal
Copy link
Contributor Author

Then we will add datadog that need the docker registry secrets

As per https://github.com/jenkins-infra/kubernetes-management/pull/6020/files#r1890521384, we'll start with datadog (changed since yesterday)

@dduportal
Copy link
Contributor Author

Update:

=> cluster still has 1 node but it is up and running

Next steps:

  • Add a new "applications" node group
    • Need to choose the correct sizing
    • Drain the current "tiny linux" and remove it to use applications fully instead
  • Set up cluster-autoscaler to have anti-affinity
    • Will "break" the deployment (only 1 replica running on the unique node)
    • Good test of its HA mode: it should keep working, trigger a scale-up, and auto-heal. Fallback to manual scale up if it breaks
  • Set up CoreDNS to have anti-affinity
  • Export node labels and taints in the JSON export
  • Add datadog Helm release
    • Involve retrieving node labels from Terraform JSON export and tolerations to specify nodeSelectors and node tolerations
  • Add EBS addon to support ACP
  • Add ACP Helm release
    • Involve retrieving node labels from Terraform JSON export and tolerations to specify nodeSelectors and node tolerations
    • Decide which volume provisiniong pattern to use as ACP is a statefulset:
      • dynamic (PV/PVC in the helm chart) or static (Terraform defined + JSON export)
      • Topology awareness (availability zone constraint)
      • Might need to update the node group "application" to spread across the 2 subnets (distincts AZs) provided to EKS and set up 1 replica per AZ
  • Track missing elements with updatecli (search for TODO in the Terraform project)
  • Add the 2 new node groups for agents and bom-agents + export their labels/taints
  • Set up the rest of the helm charts
  • Optional: can we use instance identity to run the cluster auth (like we do for ec2) instead of creating a svc account?

@dduportal
Copy link
Contributor Author

Next steps:

  • Add a new "applications" node group
    • Need to choose the correct sizing
    • Drain the current "tiny linux" and remove it to use applications fully instead
      ...
  • Export node labels and taints in the JSON export

jenkins-infra/terraform-aws-sponsorship#71

@dduportal
Copy link
Contributor Author

Set up cluster-autoscaler to have anti-affinity
...
Set up CoreDNS to have anti-affinity

As per the cluster-autoscaler and coredns recommendations, we should not do this as it may constrain the cluster when operating upgrades. We shall let the scheduler do its job instead (as in EKS, like AKS, it relaxes constraints when possible)

@dduportal
Copy link
Contributor Author

Export node labels and taints in the JSON export

https://reports.jenkins.io/jenkins-infra-data-reports/aws-sponsorship.json => LGTM

@dduportal
Copy link
Contributor Author

dduportal commented Dec 23, 2024

Update: had to re-create the cluster to ensure a successful bootstrap. There was a lot of node creation attempts in NotReady due to many factors:

  • Network ACL were blocking some requests to the Amazon ECRs hosting some of their addons (coredns) and the cluster-autoscaler image
  • Adding tolerations to the kube proxy and CNI adds on did messed up their configuration (most probably the addon "preserve/overwrite" system that I misunderstood).
    • But it is REQUIRED for CoreDNS addon and cluster-autoscaler...

Related code changes:

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Dec 23, 2024
…as unique release (#6020)

as per
jenkins-infra/helpdesk#4319 (comment)

starting adding the new EKS cluster to infra.ci kubernetes-management

kubeconfig added as secrets here
jenkins-infra/charts-secrets@a24b1ec
and datadog api key here
jenkins-infra/charts-secrets@c7505e8

need #6021

⚠️ BEFORE merging this PR we need to create the `datadog` namespace
using : 

```
kubectl config use-context arn:aws:eks:us-east-2:326712726440:cluster/cijenkinsio-agents-2
kubectl create ns datadog
```
 

splitting in multiple PR:

this one is with the minimum release possible, so only datadog as a
start
@dduportal
Copy link
Contributor Author

dduportal commented Dec 23, 2024

Annnnd datadog is installed: jenkins-infra/kubernetes-management#6020 Merry Christmas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants