Skip to content

Commit

Permalink
Merge pull request #168 from ruecarlo/eks-1.21-jenkins-fix
Browse files Browse the repository at this point in the history
Update to fix EKS Jenkins optional section
  • Loading branch information
ruecarlo authored Nov 7, 2021
2 parents 2523659 + b251efe commit ce3f49a
Show file tree
Hide file tree
Showing 6 changed files with 178 additions and 86 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ weight: 80

# Running Jenkins jobs - optional module

In this section, we will deploy a Jenkins master server into our cluster, and configure build jobs that will launch Jenkins agents inside Kubernetes pods. The Kubernetes pods will run on a dedicated Spot nodegroup with the optimized configuration for this type of workload, and we will demonstrate automatically restarting jobs that could potentially fail due to EC2 Spot Interruptions, that occur when EC2 needs the capacity back.
In this section, we will deploy a Jenkins server into our cluster, and configure build jobs that will launch Jenkins agents inside Kubernetes pods. The Kubernetes pods will run on a dedicated EKS managed node group with Spot capacity. We will demonstrate automatically restarting jobs that could potentially fail due to EC2 Spot Interruptions, that occur when EC2 needs the capacity back.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ date: 2018-08-07T08:30:11-07:00
weight: 80
---

In a previous module in this workshop, we saw that we can use Kubernetes cluster-autoscaler to automatically increase the size of our nodegroups (EC2 Auto Scaling groups) when our Kubernetes deployment scaled out, and some of the pods remained in `pending` state due to lack of resources on the cluster. Let's check the same concept applies for our Jenkins worker nodes and see this in action.
In a previous module in this workshop, we saw that we can use Kubernetes cluster-autoscaler to automatically increase the size of our node groups (EC2 Auto Scaling groups) when our Kubernetes deployment scaled out, and some of the pods remained in `pending` state due to lack of resources on the cluster. Let's check the same concept applies for our Jenkins worker nodes and see this in action.

If you recall, Cluster Autoscaler was configured to Auto-Discover Auto Scaling groups created with the tags : k8s.io/cluster-autoscaler/enabled, and k8s.io/cluster-autoscaler/eksworkshop-eksctl. You can find out in the AWS Console section for **EC2 -> Auto Scaling Group**, that the new jenkins node group does indeed have the right tags defined.

Expand All @@ -16,24 +16,27 @@ CI/CD workloads can benefit of Cluster Autoscaler ability to scale down to 0! Ca
#### Running multiple Jenkins jobs to reach a Pending pods state
If we replicate our existing Sleep-2m job and run it 5 times, that should be enough for the EC2 Instance in the Jenkins dedicated nodegroup to run out of resources (CPU/Mem), triggering a Scale Up activity from cluster-autoscaler to increase the size of the EC2 Auto Scaling group.\

1\. On the Jenkins dashboard, in the left pane, click **New Item**\
2\. Under **Enter an item name**, enter `sleep-2m-2`\
3\. At the bottom of the page, in the **Copy from** field, start typing Sleep-2m until the job name is auto completed, click **OK**\
4\. In the job configuration page, click **Save**\
5\. Repeat steps 1-4 until you have 5 identical jobs with different names\
6\. In the Jenkins main dashboard page, click the "**Schedule a build for Sleep-2m-***" on all 5 jobs, to schedule all our jobs at the same time\
7\. Monitor `kubectl get pods -w` and see pods with `jenkins-agent-abcdef` name starting up, until some of them are stuck in `pending` state. You can also use the Kube-ops-view for that purpose.\
8\. Check the cluster-autoscaler log by running `kubectl logs -f deployment/cluster-autoscaler -n kube-system`\
9\. The following lines would indicate that cluster-autoscaler successfully identified the pending Jenkins agent pods, detremined that the nodegroups that we created in the previous workshop module are not suitable due to the node selectors, and finally increased the size of the Jenkins dedicated nodegroup in order to have the kube-scheduler schedule these pending pods on new EC2 Instances in our EC2 Auto Scaling group.\
1. On the Jenkins dashboard, in the left pane, click **New Item**.
2. Under **Enter an item name**, enter `sleep-2m-2`.
3. At the bottom of the page, in the **Copy from** field, start typing Sleep-2m until the job name is auto completed, click **OK**.
4. In the job configuration page, click **Save**.
5. Repeat steps 1-4 until you have 5 identical jobs with different names.
6. In the Jenkins main dashboard page, click the "**Schedule a build for Sleep-2m-***" on all 5 jobs, to schedule all our jobs at the same time.
7. Monitor `kubectl get pods -w` and see pods with `jenkins-agent-abcdef` name starting up, until some of them are stuck in `pending` state. You can also use the Kube-ops-view for that purpose.
8. Check the cluster-autoscaler log by running `kubectl logs -f deployment/cluster-autoscaler -n kube-system`.
9. The following lines would indicate that cluster-autoscaler successfully identified the pending Jenkins agent pods, detremined that the nodegroups that we created in the previous workshop module are not suitable due to the node selectors, and finally increased the size of the Jenkins dedicated nodegroup in order to have the kube-scheduler schedule these pending pods on new EC2 Instances in our EC2 Auto Scaling group.

```
Pod default/default-5tb2v is unschedulable
Pod default-5tb2v can't be scheduled on eksctl-eksworkshop-eksctl10-nodegroup-dev-8vcpu-32gb-spot-NodeGroup-16XJ6GMZCT3XQ, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
Pod default-5tb2v can't be scheduled on eksctl-eksworkshop-eksctl10-nodegroup-dev-4vcpu-16gb-spot-NodeGroup-1RBXH0I6585MX, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
Best option to resize: eksctl-eksworkshop-eksctl10-nodegroup-jenkins-agents-2vcpu-8gb-spot-2-NodeGroup-7GE4LS6B34DK
Estimated 1 nodes needed in eksctl-eksworkshop-eksctl10-nodegroup-jenkins-agents-2vcpu-8gb-spot-2-NodeGroup-7GE4LS6B34DK
Final scale-up plan: [{eksctl-eksworkshop-eksctl10-nodegroup-jenkins-agents-2vcpu-8gb-spot-2-NodeGroup-7GE4LS6B34DK 1->2 (max: 5)}]
Scale-up: setting group eksctl-eksworkshop-eksctl10-nodegroup-jenkins-agents-2vcpu-8gb-spot-2-NodeGroup-7GE4LS6B34DK size to 2
I1102 14:49:02.645241 1 scale_up.go:300] Pod jenkins-agent-pk7cj can't be scheduled on eksctl-eksworkshop-eksctl-nodegroup-ng-spot-8vcpu-32gb-NodeGroup-1DRVQJ43PHZUK, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I1102 14:49:02.645257 1 scale_up.go:449] No pod can fit to eksctl-eksworkshop-eksctl-nodegroup-ng-spot-8vcpu-32gb-NodeGroup-1DRVQJ43PHZUK
I1102 14:49:02.645416 1 scale_up.go:468] Best option to resize: eks-jenkins-agents-mng-spot-2vcpu-8gb-8abe6f97-53a9-a62a-63f3-a92e6310750c
I1102 14:49:02.645424 1 scale_up.go:472] Estimated 1 nodes needed in eks-jenkins-agents-mng-spot-2vcpu-8gb-8abe6f97-53a9-a62a-63f3-a92e6310750c
I1102 14:49:02.645485 1 scale_up.go:586] Final scale-up plan: [{eks-jenkins-agents-mng-spot-2vcpu-8gb-8abe6f97-53a9-a62a-63f3-a92e6310750c 1->2 (max: 5)}]
I1102 14:49:02.645498 1 scale_up.go:675] Scale-up: setting group eks-jenkins-agents-mng-spot-2vcpu-8gb-8abe6f97-53a9-a62a-63f3-a92e6310750c size to 2
I1102 14:49:02.645519 1 auto_scaling_groups.go:219] Setting asg eks-jenkins-agents-mng-spot-2vcpu-8gb-8abe6f97-53a9-a62a-63f3-a92e6310750c size to 2
```
10\. The end result, which you can see via `kubectl get pods` or Kube-ops-view, is that all pods were eventually scheduled, and in the Jenkins dashboard, you will see that all 5 jobs have completed successfully.

10. The end result, which you can see via `kubectl get pods` or Kube-ops-view, is that all pods were eventually scheduled, and in the Jenkins dashboard, you will see that all 5 jobs have completed successfully.

Great result! Let's move to the next step and clean up the Jenkins module.
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,5 @@ helm delete cicd

### Removing the Jenkins nodegroup
```
eksctl delete nodegroup -f spot_nodegroup_jenkins.yml --approve
eksctl delete nodegroup -f add-mng-spot-jenkins.yml --approve
```
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ date: 2018-08-07T08:30:11-07:00
weight: 70
---

We now have a dedicated Spot nodegroup with the capacity-optimized allocation strategy that should decrease the chances of Spot Instances being interrupted, and we configured Jenkins to run jobs on those EC2 Spot Instances. We also installed the Naginator plugin which will allow us to retry failed jobs.
We now have a dedicated managed node group with Spot capacity. We also installed the Naginator plugin which will allow us to retry failed jobs.

#### Creating a Jenkins job
#### Create a Jenkins job
1. On the Jenkins dashboard, in the left pane, click **New Item**
2. Enter an item name: **Sleep-2m**, select **Freestyle project** and click **OK**
3. Scroll down to the **Build** section, and click **Add build step** -> **Execute shell**
Expand All @@ -21,39 +21,37 @@ Since this workshop module focuses on resilience and cost optimization for Jenki
{{% /notice %}}

#### Running the Jenkins job
1\. On the project page for Sleep-2m, in the left pane, click the **Build Now** button\
2\. Browse to the Kube-ops-view tool, and check that a new pod was deployed with a name that starts with `jenkins-agent-`\

1. On the project page for Sleep-2m, in the left pane, click the **Build Now** button
2. Browse to the Kube-ops-view tool, and check that a new pod was deployed with a name that starts with `'jenkins-agent-'`.
{{%expand "Show me how to get kube-ops-view url" %}}
Execute the following command on Cloud9 terminal
```
kubectl get svc kube-ops-view | tail -n 1 | awk '{ print "Kube-ops-view URL = http://"$4 }'
```
{{% /expand %}}

3\. Check the node on which the pod is running - is the nodegroup name jenkins-agents-2vcpu-8gb-spot? If so, it means that our labeling and Node Selector were configured successfully. \
4\. Run `kubectl get pods`, and find the name of the Jenkins master pod (i.e cicd-jenkins-123456789-abcde)\
5\. Run `kubectl logs -f <pod name from last step> `\
6\. Do you see log lines that show your job is being started? for example "Started provisioning Kubernetes Pod Template from kubernetes with 1 executors. Remaining excess workload: 0"\
7\. Back on the Jenkins Dashboard, In the left pane, click **Build History** and click the console icon next to the latest build. When the job finishes, you should see the following console output:\
3. Check the node on which the pod is running - is the nodegroup name `jenkins-agents-mng-spot-2vcpu-8gb`? If so, it means that our labeling and Node Selector were configured successfully.
4. Run `kubectl get pods`, and find the name of the Jenkins controller pod (i.e `cicd-jenkins-*`).
5. Run `kubectl logs -f <pod name from last step> -c jenkins `.
6. Do you see log lines that show your job is being started? for example "jenkins-agent-* provisioning successfully completed. We have now 2 computer(s)".
7. Back on the Jenkins Dashboard, In the left pane, click **Build History** and click the console icon next to the latest build. When the job finishes, you should see the following console output:

```
Building remotely on jenkins-agent-bwtmp (cicd-jenkins-slave) in workspace /home/jenkins/agent/workspace/Sleep-2m
[Sleep-2m] $ /bin/sh -xe /tmp/jenkins365818066752916558.sh
Building remotely on jenkins-agent-nkz2z (cicd-jenkins-agent) in workspace /home/jenkins/agent/workspace/Sleep-2m
[Sleep-2m] $ /bin/sh -xe /tmp/jenkins7588311786413895922.sh
+ sleep 2m
+ echo Job finished successfully
Job finished successfully
Finished: SUCCESS
```

#### Job failure and automatic retry
Now that we ran our job successfully on Spot Instances, let's test the failure scenario. Since we cannot simulate an EC2 Spot Interruption on instances that are running in an EC2 Auto Scaling group, we will demonstrate a similar effect by simply terminating the instance that our job/pod is running on.
Now that we ran our job successfully on Spot Instances, let's test the failure scenario. We will demonstrate a failure by simply terminating the instance that our job/pod is running on.

1. Go back to the Sleep-2m project page in Jenkins, and click **Build Now**
2. Run `kubectl get po --selector jenkins/cicd-jenkins-slave=true -o wide` to find the Jenkins agent pod and the node on which it is running
3. Run `kubectl describe node <node name from the last command>` to find the node's EC2 Instance ID under the `alpha.eksctl.io/instance-id` label
4. Run `aws ec2 terminate-instances --instance-ids <instance ID from last command>`
5. Back in the Jenkins dashboard, under the **Build History** page, you should now see the Sleep-2m job as broken. You can click the Console button next to the failed run, to see the JNLP errors that indicate that the Jenkins agent was unable to communicate to the Master, due to the termination of the EC2 Instance.
1. Go back to the Sleep-2m project page in Jenkins, and click **Build Now**.
2. Run `kubectl get po --selector jenkins/cicd-jenkins-agent=true -o wide` to find the Jenkins agent pod and the node on which it is running.
3. Run `kubectl describe node <node name from the last command>` to find the node's EC2 Instance ID under `ProviderID: aws:///*/i-xxxxx`.
4. Run `aws ec2 terminate-instances --instance-ids <instance ID from last command>`.
5. Back in the Jenkins dashboard, under the **Build History** page, you should now see the Sleep-2m job as broken. You can click the Console button next to the failed run, to see the JNLP errors that indicate that the Jenkins agent was unable to communicate to the Controller, due to the termination of the EC2 Instance.
6. Within 1-3 minutes, the EC2 Auto Scaling group will launch a new replacement instance, and once it has joined the cluster, the sleep-2m job will be retried on the new node. You should see see the sleep-2m job succeed in the Build History page or Project page.

Now that we successfully ran a job on a Spot Instance, and automatically restarted a job due to a simulated node failure, let's move to the next step in the workshop and autoscale our Jenkins nodes.
Loading

0 comments on commit ce3f49a

Please sign in to comment.