See README.md
for more details on how to use these questions.
These operability questions are useful for helping teams assess the operability of software systems. The operability questions can be used alone or in conjunction with tools like the Run Book dialogue sheet to help discover additional operability criteria or gaps.
Each question requires answers to these key questions:
- Who? (What?): What kind of user or persona will do this? (Or what kind of system?)
- How?: Which tool (or set of tools) or process will help to do this?
- Evidence: How will you demonstrate this using evidence?
- Score: a score of 1 to 5 for how well your approach compares to industry-leading approaches (1 is poor; 5 is excellent)
Use the Poor and Excellent criteria to help guide your scores. These are examples of bad and good extremes; your situation may demand slightly different critera.
Print this page and record questions using pen & paper with your team
Copyright © 2018 Conflux Digital Ltd
Licenced under CC BY-SA 4.0
Permalink: OperabilityQuestions.com
We need to be able to trace a request across multiple servers/containers/nodes for runtime diagnostic purposes.
The software delivery team and the Web Operations team do this
We use Correlation IDs in HTTP headers, logging the ID at each subsystem, and then searching for the ID in Kibana
We test that Correlation IDs are working properly with BDD tests written in Cucumber. See tests/correlation-tests
folder e.g. trace-id.feature
4
USABILITY: Which people or groups do we collaborate with to ensure an effective operator experience?
We need a clear understanding of the people and teams that can help to make the software systems work well.
- Poor: We fix any operator problems after go-live
- Excellent: We collaborate with the live service / operations teams from the start of the project
USABILITY: How often and in what ways do we collaborate with other teams on operational aspects of the system? At which stages of the development cycle do we assess and meet operational needs?
We should have a clear approach for meeting the needs of operations people.
- Poor: We respond to operational requests after go-live when tickets are raised by the live service teams
- Excellent: We collaborate on operational aspects from the very first week of the engagement/project
VIABILITY: What proportion of product budget and team effort is spent addressing operational aspects? How do you track this?
We should be spending a good proportion of time and effort on addressing operational aspects.
- Poor: We try to spend as little time and effort as possible on operational aspects / We do not track the spend on operational aspects at all
- Excellent: We spend 30% of our time and budget addressing operational aspects
VIABILITY: What is the longest time between addressing two operational features within your team? That is, what is the longest time between deployments of operationally-focused changes?
We should be addressing operational aspects on a regular, frequent basis throughout the project, not occasionally or just at the end of the project.
- Poor: We do not track operational aspects / It can take months to address operational aspects
- Excellent: We deploy changes to address operational aspects as frequently as we deploy changes to address user-visible features
CONFIGURATION: How do we know which feature toggles (feature switches) are active for this subsystem?
We need clarity about the number and nature of feature toggles for a system. Feature toggle need careful management.
- Poor: We need to run diffs against config files to determine which feature toggles are arctive
- Excellent: We have a simple UI or API to report the active/inactive feature flags in an environment
We need to be able to change the configuration of software in an environment without redeploying the executable binaries or scripts.
- Poor: We cannot deploy a configuration change without deploying the software
- Excellent: We simply run a config deployment separately from the software
We need to ensure that only valid, tested configuration data is being used and that the configuration schema itself is controlled.
- Poor: We cannot verify the configuration in use
- Excellent: We use
sha256sum
hashes to verify the configuration in use
We need to define simple ways to report health of the system in ways that are meaningful for that system.
- Poor: We wait for checks made manually by another team to tell us if our software is healthy
- Excellent: We query the software using a standard HTTP healthcheck URL, returning HTTP 200/500, etc. based on logic that we write in the code
DIAGNOSIS: How do we track the main service/system Key Performance Indicators (KPIs)? What are the KPIs?
We need to define simple ways to report health of the system in ways that are meaningful for that system.
- Poor: We do not have service KPIs defined
- Excellent: We use logging and/or time series metrics to emit service KPIs that are picked up by a dashboard
Logging is a key aspect of modern software systems and must be working correctly at all times.
- Poor: We do not test if logging is working
- Excellent: We test that logging is working using BDD feature tests that search for specific log message strings after a particular application behaviour is executed
Time series metrics are a key aspect of modern software systems and must be working correctly at all times.
- Poor: We do not test if time series metrics are working
- Excellent: We test that time series metrics are working using BDD feature tests that search for specific time series data after a particular application behaviour is executed
TESTABILITY: How do we show that the software system is easy to test? What do we provide and to whom?
Keeping software testable is a key aspect of operability.
- Poor: We do not explicitly aim to make out software easily testable
- Excellent: We run clients and external test packs against all parts of our software within our deployment pipeline
We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
- Poor: We do not know when our certificates are going to expire
- Excellent: We use certificate monitoring tools to keep a live check on when certs will expire so we can take remedial action ahead of time
We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
- Poor: Another team renews and installs SSL/TLS certificates manually
- Excellent: We use automated processes to renew and configure SSL/TLS certificates using Let's Encrypt
We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
- Poor: Another team renews and installs certificates manually
- Excellent: We use automated processes to renew and configure certificates using an API
We need to encrypt data in transit to prevent eavesdropping.
- Poor: We do not explicitly test for transport security; we assume that another team will configure security for us
- Excellent: We test for secure transport as a specific feature of our application
SECURITY: How do we ensure that sensitive data in logs is masked or hidden?
We need to mask or hide sensitive data in logs whilst still exposing the surrounding data to teams.
- Poor: We do not test for data masking in logs
- Excellent: We test that data masking is happening by using BDD feature tests that search for specific log message strings after a particular application behaviour is executed
We need to apply patches to public-facing systems as quickly as possible (but still safely) when a Zero-Day vulnerability is disclosed.
- Poor: Another team is responsible for patching / We do not know if or when a Zero-Day vulnerability occurs
- Excellent: We work with the security team to test and roll out a fix, using our automated deployment pipeline to test and deploy the change
We need to demonstrate that the software can perform well.
- Poor: We rely on the Performance team to validate the performance of our service or application
- Excellent: We run a set of indicative performance tests within our deployment pipeline that are run on every check-in to version control
We need to define and share a set of known failure modes or failure conditions so we better understand how the software will operate.
- Poor: We do not really know how the system might fail
- Excellent: We use a set of error identifiers to define the failure modes in our software and we use these identifiers in our log messages
RESILIENCE: How are we sure that connection retry schemes (such as Exponential Backoff) are working?
We need to demonstrate that the system does not overload downstream systems with reconnection attempts, and uses sensible back-off schemes.
- Poor: We do not really know whether connection retry works properly
- Excellent: We test the connection retry logic as part of our automated deployment pipeline
We need to be able to trace a request across multiple servers/containers/nodes for runtime diagnostic purposes.
- Poor: We do not trace calls through the system
- Excellent: We use a standard tracing library such as OpenTracing to trace calls through the system. We collaborate with othher teams to ensure that the correct tracing fields are maintained across component boundaries
We need to display key information about the live operation of the system to teams focused on operations.
- Poor: Operations teams tend to discover the status indicators themselves
- Excellent: We build a dashboard in collaboration with the Operations teams so they have all the details they need in a user-friendly way with UX a key consideration
We need to run synthetic transactions against the live systems on a regular basis. How well does the synthetic monitoring detect problems?
- Poor: The service often goes down without us knowing, and users inform us before we detect a problem / We do not have synthetic monitoring in place for key scenarios
- Excellent: The synthetic monitoring alerts us quickly to problems with key journeys or scenarios
We need to demonstrate that the software can recover from internal failures gracefully.
- Poor: We do not really know whether the system can recover from internal failures
- Excellent: We test many internal failure scenarios as part of our automated deployment pipeline
We need to demonstrate that the software can recover from external failures gracefully.
- Poor: We do not really know whether the system can recover from external failures
- Excellent: We test many external failure scenarios as part of our automated deployment pipeline
Note: operability does not guarantee safety or ethical soundness. Discuss and follow up separately.
Note: operability does not guarantee safety. Discuss and follow up separately.
Note: operability does not guarantee ethical operation. Discuss and follow up separately.