Skip to content

Commit

Permalink
CHAOS-31: Support durationSeconds in disruption spec (#400)
Browse files Browse the repository at this point in the history
* CHAOS-31: Add duration with default to disruptions

* interim logic changes

* Delete chaos pods if they were terminated by duration

* Stop requeueing prematurely after injection is done

* clean up the logging

* remove the magic number

* add test that fails because we cant configure the gc yet

* passing test now that we can configure

* allow for configuring outside of the code

* Add comment

* Make duration a duration

* allow for configuration

* Update controllers/disruption_controller.go

Co-authored-by: Joris Bonnefoy <[email protected]>

* add example

Co-authored-by: Joris Bonnefoy <[email protected]>
  • Loading branch information
ptnapoleon and Devatoria authored Sep 29, 2021
1 parent f668122 commit 8607db3
Show file tree
Hide file tree
Showing 17 changed files with 277 additions and 21 deletions.
6 changes: 6 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,12 @@ jobs:
sudo mkdir -p /go /usr/local/bin /usr/local/kubebuilder
sudo chown circleci:circleci /go /usr/local/bin /usr/local/kubebuilder
- go_restore_cache
- run:
name: Edit controller chart
<<: *working_directory
command: |
sudo snap install yq
yq e '.controller.expiredDisruptionGCDelay = "3s"' -i chart/values.yaml
- run:
name: Wait for Docker Daemon to be up and running
command: |
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ spec:
selector: # a label selector used to target resources
app: demo-curl
count: 1 # the number of resources to target
durationSeconds: 3600 # the amount of time before your disruption automatically terminates itself
nodeFailure:
shutdown: false # trigger a kernel panic on the target node
```
Expand Down
1 change: 1 addition & 0 deletions api/v1beta1/disruption_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ type DisruptionSpec struct {
AdvancedSelector []metav1.LabelSelectorRequirement `json:"advancedSelector,omitempty"` // advanced label selector
DryRun bool `json:"dryRun,omitempty"` // enable dry-run mode
OnInit bool `json:"onInit,omitempty"` // enable disruption on init
DurationSeconds int64 `json:"durationSeconds,omitempty"` // seconds from disruption creation until chaos pods are deleted and no more are created
// +kubebuilder:validation:Enum=pod;node;""
// +ddmark:validation:Enum=pod;node;""
Level chaostypes.DisruptionLevel `json:"level,omitempty"`
Expand Down
26 changes: 26 additions & 0 deletions chart/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -772,6 +772,32 @@ webhooks:
---
# Source: chaos-controller/templates/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
annotations:
cert-manager.io/inject-ca-from: chaos-engineering/chaos-controller-serving-cert
name: chaos-controller-disruption-spec-defaults
webhooks:
- clientConfig:
caBundle: Cg==
service:
name: chaos-controller-webhook-service
namespace: chaos-engineering
path: /mutate-chaos-datadoghq-com-v1beta1-disruption-spec-defaults
failurePolicy: Fail
name: chaos-controller-admission-webhook.chaos-engineering.svc
rules:
- apiGroups:
- "chaos.datadoghq.com"
apiVersions:
- v1beta1
operations:
- CREATE
resources:
- disruptions
---
# Source: chaos-controller/templates/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
annotations:
Expand Down
2 changes: 2 additions & 0 deletions chart/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ data:
metricsSink: {{ .Values.controller.metricsSink | quote }}
deleteOnly: {{ .Values.controller.deleteOnly }}
imagePullSecrets: {{ .Values.images.pullSecrets }}
defaultDuration: {{ .Values.controller.defaultDuration }}
expiredDisruptionGCDelay: {{ .Values.controller.expiredDisruptionGCDelay }}
webhook:
{{- if .Values.controller.webhook.generateCert }}
certDir: /tmp/k8s-webhook-server/serving-certs
Expand Down
3 changes: 3 additions & 0 deletions chart/templates/crds/chaos.datadoghq.com_disruptions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,9 @@ spec:
type: array
dryRun:
type: boolean
durationSeconds:
format: int64
type: integer
level:
description: DisruptionLevel represents which level the disruption should
be injected at
Expand Down
31 changes: 31 additions & 0 deletions chart/templates/webhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,37 @@ webhooks:
resources:
- disruptions
---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
annotations:
{{- if not .Values.controller.webhook.generateCert }}
cert-manager.io/inject-ca-from: chaos-engineering/chaos-controller-serving-cert
{{- end }}
name: chaos-controller-disruption-spec-defaults
webhooks:
- clientConfig:
{{- if not .Values.controller.webhook.generateCert }}
caBundle: Cg==
{{- else }}
caBundle: {{ b64enc $ca.Cert }}
{{- end }}
service:
name: chaos-controller-webhook-service
namespace: chaos-engineering
path: /mutate-chaos-datadoghq-com-v1beta1-disruption-spec-defaults
failurePolicy: Fail
name: chaos-controller-admission-webhook.chaos-engineering.svc
rules:
- apiGroups:
- "chaos.datadoghq.com"
apiVersions:
- v1beta1
operations:
- CREATE
resources:
- disruptions
---
{{- if not .Values.controller.webhook.generateCert }}
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
Expand Down
2 changes: 2 additions & 0 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ images: # images and tag to pull for each component of the stack
controller:
deleteOnly: false # enable delete-only mode
metricsSink: noop # metrics driver (noop or datadog)
defaultDuration: 1h # default spec.duration for a disruption with none specified
expiredDisruptionGCDelay: 15m # time after a disruption expires before deleting it
webhook: # admission webhook configuration
generateCert: false # if you want Helm to generate certificates (e.g. in case the cert-manager is not installed in the cluster) set this to true
certDir: "" # certificate directory (must contain tls.crt and tls.key files)
Expand Down
52 changes: 41 additions & 11 deletions controllers/disruption_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ type DisruptionReconciler struct {
InjectorDNSDisruptionDNSServer string
InjectorDNSDisruptionKubeDNS string
InjectorNetworkDisruptionAllowedHosts []string
ExpiredDisruptionGCDelay time.Duration
}

// +kubebuilder:rbac:groups=chaos.datadoghq.com,resources=disruptions,verbs=get;list;watch;create;update;patch;delete
Expand Down Expand Up @@ -125,7 +126,6 @@ func (r *DisruptionReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error)
// handle any chaos pods being deleted (either by the disruption deletion or by an external event)
if err := r.handleChaosPodsTermination(instance); err != nil {
r.log.Errorw("error handling chaos pods termination", "error", err)

return ctrl.Result{}, err
}

Expand Down Expand Up @@ -171,10 +171,25 @@ func (r *DisruptionReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error)
// the injection is being created or modified, apply needed actions
controllerutil.AddFinalizer(instance, disruptionFinalizer)

// If the disruption is at least r.ExpiredDisruptionGCDelay older than when its duration ended, then we should delete it.
// calculateRemainingDurationSeconds returns the seconds until (or since, if negative) the durations deadline. We compare it to negative ExpiredDisruptionGCDelay,
// and if less than that, it means we have exceeded the deadline by at least ExpiredDisruptionGCDelay, so we can delete
if calculateRemainingDurationSeconds(*instance) <= (-1 * int64(r.ExpiredDisruptionGCDelay.Seconds())) {
r.log.Infow("disruption has lived for more than its duration, it will now be deleted.", "durationSeconds", instance.Spec.DurationSeconds)
r.Recorder.Event(instance, "Normal", "DurationOver", fmt.Sprintf("The disruption has lived %s longer than its specified duration, and will now be deleted.", r.ExpiredDisruptionGCDelay))

var err error

if err = r.Client.Delete(context.Background(), instance); err != nil {
r.log.Errorw("error deleting disruption after its duration expired", "error", err)
}

return ctrl.Result{Requeue: true}, err
}

// retrieve targets from label selector
if err := r.selectTargets(instance); err != nil {
r.log.Errorw("error selecting targets", "error", err)

return ctrl.Result{}, fmt.Errorf("error selecting targets: %w", err)
}

Expand All @@ -185,7 +200,6 @@ func (r *DisruptionReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error)
// start injections
if err := r.startInjection(instance); err != nil {
r.log.Errorw("error injecting the disruption", "error", err)

return ctrl.Result{}, fmt.Errorf("error injecting the disruption: %w", err)
}

Expand All @@ -199,11 +213,18 @@ func (r *DisruptionReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error)
return ctrl.Result{}, fmt.Errorf("error updating disruption injection status: %w", err)
} else if !injected {
r.log.Infow("disruption is not fully injected yet, requeuing")

return ctrl.Result{Requeue: true}, nil
}

return ctrl.Result{}, r.Update(context.Background(), instance)
requeueDelay := time.Duration(math.Max(float64(calculateRemainingDurationSeconds(*instance)), r.ExpiredDisruptionGCDelay.Seconds())) * time.Second

r.log.Infow("requeuing disruption", "requeueDelay", requeueDelay.String())

return ctrl.Result{
Requeue: true,
RequeueAfter: requeueDelay,
},
r.Update(context.Background(), instance)
}

// stop the reconcile loop, there's nothing else to do
Expand All @@ -226,8 +247,12 @@ func (r *DisruptionReconciler) updateInjectionStatus(instance *chaosv1beta1.Disr
return false, fmt.Errorf("error getting instance chaos pods: %w", err)
}

if calculateRemainingDurationSeconds(*instance) < 0 {
status = chaostypes.DisruptionInjectionStatusPreviouslyInjected
}

// consider a disruption not injected if no chaos pods are existing
if len(chaosPods) > 0 {
if status == chaostypes.DisruptionInjectionStatusNotInjected && len(chaosPods) > 0 {
// check the chaos pods conditions looking for the ready condition
for _, chaosPod := range chaosPods {
podReady := false
Expand Down Expand Up @@ -266,13 +291,10 @@ func (r *DisruptionReconciler) updateInjectionStatus(instance *chaosv1beta1.Disr
return false, err
}

// requeue the request if the disruption is not fully injected so we can
// requeue the request if the disruption is not fully injected yet, so we can
// eventually catch pods that are not ready yet but will be in the future
if status != chaostypes.DisruptionInjectionStatusInjected {
return false, nil
}

return true, nil
return status == chaostypes.DisruptionInjectionStatusInjected || status == chaostypes.DisruptionInjectionStatusPreviouslyInjected, nil
}

// startInjection creates non-existing chaos pod for the given disruption
Expand Down Expand Up @@ -480,6 +502,11 @@ func (r *DisruptionReconciler) handleChaosPodsTermination(instance *chaosv1beta1
removeFinalizer = true
}

// if the pod died only because it exceeded its activeDeadlineSeconds, we can remove the finalizer
if chaosPod.Status.Reason == "DeadlineExceeded" {
removeFinalizer = true
}

// check if the container was able to start or not
// if not, we can safely delete the pod since the disruption was not injected
for _, cs := range chaosPod.Status.ContainerStatuses {
Expand Down Expand Up @@ -726,12 +753,15 @@ func (r *DisruptionReconciler) generatePod(instance *chaosv1beta1.Disruption, ta
// ensures that whether a chaos pod is deleted directly or by deleting a disruption, it will have time to finish cleaning up after itself.
terminationGracePeriod := int64(60)

activeDeadlineSeconds := calculateRemainingDurationSeconds(*instance)

podSpec := corev1.PodSpec{
HostPID: true, // enable host pid
RestartPolicy: corev1.RestartPolicyNever, // do not restart the pod on fail or completion
NodeName: targetNodeName, // specify node name to schedule the pod
ServiceAccountName: r.InjectorServiceAccount, // service account to use
TerminationGracePeriodSeconds: &terminationGracePeriod,
ActiveDeadlineSeconds: &activeDeadlineSeconds,
Containers: []corev1.Container{
{
Name: "injector", // container name
Expand Down
30 changes: 26 additions & 4 deletions controllers/disruption_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,10 +139,11 @@ var _ = Describe("Disruption Controller", func() {
Namespace: "default",
},
Spec: chaosv1beta1.DisruptionSpec{
DryRun: true,
Count: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
Selector: map[string]string{"foo": "bar"},
Containers: []string{"ctn1"},
DryRun: true,
Count: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
Selector: map[string]string{"foo": "bar"},
Containers: []string{"ctn1"},
DurationSeconds: int64(3600),
NodeFailure: &chaosv1beta1.NodeFailureSpec{
Shutdown: false,
},
Expand Down Expand Up @@ -243,6 +244,27 @@ var _ = Describe("Disruption Controller", func() {
})
})

Context("disruption expires naturally", func() {
BeforeEach(func() {
disruption.Spec.Count = &intstr.IntOrString{Type: intstr.String, StrVal: "100%"}
disruption.Spec.DurationSeconds = int64(timeout.Seconds()) + 5
})

It("should target all the selected pods", func() {
By("Ensuring that the chaos pods have been created")
Eventually(func() error { return expectChaosPod(disruption, 12) }, timeout).Should(Succeed())

By("Ensuring that the chaos pods have correct number of targeted containers")
Expect(expectChaosInjectors(disruption, 12)).To(BeNil())

By("Waiting for the disruption to expire naturally")
Eventually(func() error { return expectChaosPod(disruption, 0) }, timeout*2).Should(Succeed())

By("Waiting for disruption to be removed")
Eventually(func() error { return k8sClient.Get(context.Background(), instanceKey, disruption) }, timeout).Should(MatchError("Disruption.chaos.datadoghq.com \"foo\" not found"))
})
})

Context("target one pod and one container only", func() {
It("should target all the selected pods", func() {
By("Ensuring that the inject pod has been created")
Expand Down
18 changes: 18 additions & 0 deletions controllers/helpers.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"fmt"
"math"
"regexp"
"time"

"github.com/DataDog/chaos-controller/api/v1beta1"
corev1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -95,6 +96,23 @@ func getScaledValueFromIntOrPercent(intOrPercent *intstr.IntOrString, total int,
return value, nil
}

func calculateRemainingDurationSeconds(instance v1beta1.Disruption) int64 {
return calculateDeadlineSeconds(
time.Duration(instance.Spec.DurationSeconds)*time.Second,
instance.ObjectMeta.CreationTimestamp.Time,
)
}

// returned value can be negative if deadline is in the past
func calculateDeadlineSeconds(duration time.Duration, creationTime time.Time) int64 {
// first we must calculate the timout from when the disruption was created, not from now
timeout := creationTime.Add(duration)
now := time.Now() // rather not take the risk that the time changes by a second during this function

// return the number of seconds between now and the deadline
return int64(timeout.Sub(now).Seconds())
}

// assert label selector matches valid grammar, avoids CORE-414
func validateLabelSelector(selector labels.Selector) error {
labelGrammar := "([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]"
Expand Down
11 changes: 11 additions & 0 deletions docs/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,17 @@ Let's imagine a node with two pods running: `foo` and `bar` and a disruption dro
* Applying this disruption at the `pod` level and with a selector targeting the `foo` pod will result with the `foo` pod not being able to send any packets, but the `bar` pod will still be able to send packets, as well as other processes on the node.
* Applying this disruption at the `node` level and with a selector targeting the node itself, both `foo` and `bar` pods won't be able to send network packets anymore, as well as all the other processes running on the node.

## Duration

The `Disruption` spec takes a `durationSeconds` field. This field represents the number of seconds after the disruption's creation before
all chaos pods automatically terminate and the disruption stops injecting new ones.

If a `durationSeconds` is not specified, then a disruption will receive the default duration, which is configured at the controller level by setting
`controller.defaultDuration` in the controller's config map, and this value defaults to 1 hour.

After a disruption's duration expires, the disruption resource will live in k8s for a default of 15 minutes. This can be configured by altering
`controller.expiredDisruptionGCDelay` in the controller's config map.

## Targeting

The `Disruption` resource uses [label selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) to target pods and nodes. The controller will retrieve all pods or nodes matching the given label selector and will randomly select a number (defined in the `count` field) of matching targets. It's possible to specify multiple label selectors, in which case the controller will select from targets that match all of them. Once applied, you can see the targeted pods/nodes by describing the `Disruption` resource.
Expand Down
1 change: 1 addition & 0 deletions examples/complete.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ spec:
- demo
- demo2
count: 1 # number of pods to target or a percentage (1% - 100%)
durationSeconds: 2700 # the number of seconds before the disruption terminates itself
nodeFailure: # node kernel panic or shutdown
shutdown: true # optional, shutdown the host instead of triggering a stack dump (defaults to false)
containerFailure: # terminating a pod's containers gracefully or non-gracefully
Expand Down
18 changes: 18 additions & 0 deletions examples/timed_disruption.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Unless explicitly stated otherwise all files in this repository are licensed
# under the Apache License Version 2.0.
# This product includes software developed at Datadog (https://www.datadoghq.com/).
# Copyright 2021 Datadog, Inc.

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
name: network-drop
namespace: chaos-demo
spec:
level: pod
selector:
app: demo-curl
count: 1
durationSeconds: 1800 # Disruption will time out after 1800s (30m)
network:
delay: 300
Loading

0 comments on commit 8607db3

Please sign in to comment.