Skip to content
This repository has been archived by the owner on Nov 2, 2024. It is now read-only.

feat: Add config-based SRE Recipe Implementation #826

Merged
merged 68 commits into from
Sep 10, 2021
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
ef597ae
docs: fix sre recipes documentation
Gan-Tu Jun 4, 2021
70d2131
configs: add YAML representation of SRE Recipes
Gan-Tu Jun 5, 2021
45029f4
feat: add a new recipe runner, with refactored base class
Gan-Tu Jun 5, 2021
ab110d9
configs: add selectors
Gan-Tu Jun 5, 2021
07401f6
feat: implement recipe runner, and leave shell commands in utils
Gan-Tu Jun 5, 2021
e4124c9
use multiline format for hints and descriptions
Gan-Tu Jun 5, 2021
2df1748
add informative status msg when running recipe commands
Gan-Tu Jun 5, 2021
e757f28
Use selector correctly, and move sre recipe configs to new folder
Gan-Tu Jun 5, 2021
ca255b6
Merge branch 'master' into gan-refactor-sre-recipe
Gan-Tu Jun 12, 2021
0b16538
Merge branch 'develop' into gan-refactor-sre-recipe
Gan-Tu Aug 13, 2021
635ad40
[docs]: add sandboxctl describe to hints
Gan-Tu Aug 13, 2021
828453e
[docs] add deprecation warning to original Recipe class
Gan-Tu Aug 13, 2021
a48b7c9
refactor: do not use command lib
Gan-Tu Aug 13, 2021
f16af3f
fix: update selector param to pod_selector in config
Gan-Tu Aug 20, 2021
4ff9864
fix: update new recipes
Gan-Tu Aug 21, 2021
0ad0639
feat: add recipe runner without verify implementation
Gan-Tu Aug 21, 2021
7311472
fix: rename run_steps to handle_actions
Gan-Tu Aug 21, 2021
bf710cf
feat: implement verify
Gan-Tu Aug 21, 2021
ed9cb55
feat: rename broken_service -> affected_service, broken_cause -> inci…
Gan-Tu Aug 21, 2021
f92d826
docs: add README for SRE Recipe configs
Gan-Tu Aug 21, 2021
601d4b0
feat: rename broken_service -> affected_service, broken_cause -> inci…
Gan-Tu Aug 21, 2021
3a8887a
docs: fix run_hint message, when there is no hints available
Gan-Tu Aug 21, 2021
fa4985e
refactor: improve multiple choice logic
Gan-Tu Aug 21, 2021
0ec4e76
minor fixes
Gan-Tu Aug 21, 2021
079081c
docs: improve README to suggest more than action is supported
Gan-Tu Aug 21, 2021
685625b
fix: fix bugs
Gan-Tu Aug 21, 2021
6622595
docs: add COPYRIGHT notices
Gan-Tu Aug 21, 2021
829d01b
chore: ignore srerecipe logs in .gitignore
Gan-Tu Aug 21, 2021
d7ef1e3
refactor: move common commands to utils
Gan-Tu Aug 21, 2021
8dd8aea
feat: use config based recipe runner; WIP
Gan-Tu Aug 21, 2021
55abdd0
feat: support disabling recipes
Gan-Tu Aug 21, 2021
e3a2ddf
refactor: rename sre recipe configs folder to configs_based
Gan-Tu Aug 21, 2021
c7fac01
feat: add implementation based sre recipe; deleted old ones and added…
Gan-Tu Aug 21, 2021
ab4099c
feat: support running impl based recipe
Gan-Tu Aug 21, 2021
34c2db6
feat: update sandboxctl to use new sre recipe runner
Gan-Tu Aug 21, 2021
326498a
feat: use exceptions based error handling for recipe runner
Gan-Tu Aug 21, 2021
61a88cc
chore: add disabled dir for impl_based recipes
Gan-Tu Aug 21, 2021
affda5c
feat: fix sandboxctl destroy
Gan-Tu Aug 21, 2021
9db7fe9
add todo for loadgen
Gan-Tu Aug 21, 2021
1accfad
Merge branch 'develop' into gan-refactor-sre-recipe
Gan-Tu Aug 21, 2021
a549aad
refactor: add single quote around format
Gan-Tu Aug 21, 2021
0b1158e
fix: add more error logging and fix typos
Gan-Tu Aug 21, 2021
9e16e93
fix: fix logging warn
Gan-Tu Aug 21, 2021
8435171
fix: update usage of utils version of get_xx_id()
Gan-Tu Aug 21, 2021
e2b6cf6
refactor: remove dead legacy sre recipe code from sandboxctl
Gan-Tu Aug 21, 2021
26a8303
fix: also decode error message
Gan-Tu Aug 21, 2021
db789c0
docs: fix recipe3 name
Gan-Tu Aug 27, 2021
7c3bd2f
fix: authentication and external IP command
Gan-Tu Aug 27, 2021
7a81330
Update utils.py
Gan-Tu Aug 27, 2021
1b5e978
fix: get external IP only in "cloud-ops-sandbox" cluster
Gan-Tu Aug 27, 2021
c36ae89
feat: support loadgen in config
Gan-Tu Aug 28, 2021
4c9e5da
fix: do not silently exit when cluster authentication fail
Gan-Tu Aug 28, 2021
05058c2
fix: rename sre recipe config fields
Gan-Tu Aug 28, 2021
e3d06d3
refactor: use constants for defaults
Gan-Tu Aug 28, 2021
bd6807b
docs: update readme for sre recipe configs
Gan-Tu Aug 28, 2021
9318457
fix: print error when sandboxctl describe fails
Gan-Tu Aug 28, 2021
36902b6
Stop printing SRE Recipe API response in break/restore
Gan-Tu Aug 28, 2021
5932ca8
docs: improve documentation
Gan-Tu Sep 4, 2021
943deb8
refactor: use new config format for actions
Gan-Tu Sep 4, 2021
82a82ba
feat: use action handler for multiple choice
Gan-Tu Sep 4, 2021
f81718a
refactor: extract action handlers out to its own class
Gan-Tu Sep 4, 2021
c2bb35c
docs: update README for sre recipe config
Gan-Tu Sep 4, 2021
1a38a7e
refactor: rename multiple choice quiz, and add handler docstring
Gan-Tu Sep 4, 2021
ef47ab8
fix: update missed legacy action type
Gan-Tu Sep 4, 2021
9a4ac8a
fix: use subprocess.run instead to avoid merging of stdout,stderr
Gan-Tu Sep 4, 2021
1783171
fix typos
Gan-Tu Sep 10, 2021
63911ed
refactor: disable dummy recipe
Gan-Tu Sep 10, 2021
bb28dee
Merge branch 'develop' into gan-refactor-sre-recipe
Gan-Tu Sep 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ terraform/istio/istioctl
terraform/istio/istio-*/**
.token
skaffold
website/resources*
website/resources*
srerecipes.log
163 changes: 0 additions & 163 deletions sre-recipes/recipe.py

This file was deleted.

196 changes: 196 additions & 0 deletions sre-recipes/recipe_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# -*- coding: utf-8 -*-

import abc
import importlib
import requests
import subprocess
import yaml

from inspect import isclass
from os import path

import utils
from recipes.impl_based.base import BaseRecipeImpl

Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved

# Default Load Generation Config
DEFAULT_LOADGEN_USER_TYPE = "BasicHomePageViewingUser"
DEFAULT_LOADGEN_USER_COUNT = 20
DEFAULT_LOADGEN_SPAWN_RATE = 1
DEFAULT_LOADGEN_TIMEOUT_SECONDS = 600


class ImplBasedRecipeRunner:
"""A SRE Recipe runner for running recipes implemented as class objects.

Given a `recipe_name`, it tries to run `recipes/impl_based/recipe_name.py`.

This runner will propgate all exceptions to the caller, and it is caller's
responsibility to handle any exception and to perform any error logging.
"""

def __init__(self, recipe_name):
self.recipe = None
module = importlib.import_module(f"recipes.impl_based.{recipe_name}")
for attribute_name in dir(module):
attr = getattr(module, attribute_name)
if isclass(attr) and attr is not BaseRecipeImpl and issubclass(attr, BaseRecipeImpl):
self.recipe = attr()
break
if not self.recipe:
raise NotImplementedError(
f"No valid implementation exists for `{recipe_name}` recipe.")

def get_name(self):
return self.recipe.get_name()

def get_description(self):
return self.recipe.get_description()

def run_break(self):
return self.recipe.run_break()

def run_restore(self):
return self.recipe.run_restore()

def run_hint(self):
return self.recipe.run_hint()

def run_verify(self):
return self.recipe.run_verify()


class ConfigBasedRecipeRunner:
"""A SRE Recipe runner for running recipes implemented using configs.

Given a `recipe_name`, it tries to load `recipes/configs_based/recipe_name.yaml`.

This runner will propgate all exceptions to the caller, and it is caller's
Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved
responsibility to handle any exception and to perform any error logging.
"""

def __init__(self, recipe_name):
filepath = path.join(path.dirname(
path.abspath(__file__)), f"recipes/configs_based/{recipe_name}.yaml")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we might want to be tolerant and allow yml extension too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given .yaml is recommended by YAML and we own this repo. Let's go with the consistent .yaml extension

https://yaml.org/faq.html

with open(filepath, "r") as file:
self.recipe = yaml.safe_load(file.read())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to define a grammar(structure) for the yaml config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot. But since the parsed YAML is in JSON format, we can use JSON-Schema library to do validation checks. I am planning to do that as a part of the follow-up PR for integration testing and validations.

https://stackoverflow.com/questions/3262569/validating-a-yaml-document-in-python

if not self.recipe:
raise ValueError("Cannot parse config as YAML.")

def get_name(self):
return self.recipe.get("name", "No name found")

def get_description(self):
return self.recipe.get("description", "No description found")

############################ Run Recipe ###################################

@ property
def config(self):
return self.recipe.get("config", {})

def run_break(self):
print('Deploying broken service...')
self.__handle_actions(self.config.get("break", {}))
print('Done. Deployed broken service')

def run_restore(self):
print('Restoring broken service...')
Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved
self.__handle_actions(self.config.get("restore", {}))
print('Done. Restored broken service to working state.')

def run_hint(self):
hint = self.config.get("hint", None)
if hint:
print(f'Here is your hint!\n\n{hint}')
else:
print("This recipe has no hints.")

def run_verify(self):
verify_config = self.config.get("verify", {})
if not verify_config:
raise NotImplementedError("Verify is not configured")

affected_service_config = verify_config.get("affected_service", {})
if affected_service_config:
if "answer" not in affected_service_config:
raise ValueError(
"Correct answer is not specified for affected service quiz.")
elif "choices" not in affected_service_config:
raise ValueError(
"No answer choices configured in affected service quiz.")
utils.run_interactive_multiple_choice(
"Which service has an issue?",
Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved
affected_service_config["choices"],
affected_service_config["answer"])

incident_cause_config = verify_config.get("incident_cause", {})
if incident_cause_config:
if "answer" not in incident_cause_config:
raise ValueError(
"Correct answer is not specified for incident cause quiz.")
elif "choices" not in incident_cause_config:
raise ValueError(
"No answer choices configured in incident cause quiz.")
utils.run_interactive_multiple_choice(
"What was the cause of the issue?",
incident_cause_config["choices"],
incident_cause_config["answer"])

########################## Recipe Action Handlers ##########################

def __handle_actions(self, actions):
"""
Dispatch and handle a list of actions synchronously.

Paramters
---------
actions: a list of dictionary of paramters.
Example: [{'run': 'echo "Hello World!"'}]
"""
loadgen_ip = None

for action in actions:
if "run" in action:
Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved
output, err = utils.run_shell_command(action["run"])
if err:
raise RuntimeError(f"Failed to run action {action}: {err}")
elif "loadgen" in action:
if not loadgen_ip:
loadgen_ip, err = utils.get_loadgen_ip()
if err:
raise RuntimeError(f"Failed to get loadgen IP: {err}")
if action["loadgen"] == "stop":
resp = requests.post(f"http://{loadgen_ip}:81/api/stop")
if not resp.ok:
raise RuntimeError(
f"Failed to stop existing load generation: {resp.status_code} {resp.reason}")
elif action["loadgen"] == "spawn":
user_type = action.get(
"user_type", DEFAULT_LOADGEN_USER_TYPE)
resp = requests.post(
f"http://{loadgen_ip}:81/api/spawn/{user_type}",
{
"user_count": int(action.get("user_count", DEFAULT_LOADGEN_USER_COUNT)),
"spawn_rate": int(action.get("spawn_rate", DEFAULT_LOADGEN_SPAWN_RATE)),
"stop_after": int(action.get("stop_after", DEFAULT_LOADGEN_TIMEOUT_SECONDS))
})
if not resp.ok:
raise RuntimeError(
f"Failed to start load generation: {resp.status_code} {resp.reason}")
else:
raise NotImplementedError(f"action not supported: {action}")
Gan-Tu marked this conversation as resolved.
Show resolved Hide resolved
Loading