-
Notifications
You must be signed in to change notification settings - Fork 116
Incident Response
Platform outages are a fact of life. The only software that doesn't fail is software that has no users. What matters are the plans that are in place to
- prevent outages in the first place
- tooling and processes to identify issues early
- respond to issues decisively without causing more harm
- creating confidence in the response
As is customary with large platforms extensive testing is conducted on every 0L release.
We have added hundreds of "continuous integration" tests to the code base. This means that on every new change to the code, a battery of tests are run automatically. New code is only "merged" to the main code when these pass.
These tests are similar to "unit tests" in other languages, where you test individual behavior of each functions in the "smart contracts" that control the protocol policies. In the Move environment this is don't by simulating transactions on the network.
More involved testing of features in the protocol will require specific tooling to be built to create certain scenarios (how many users, validators, etc on the chain). This is done through an end-to-end testing environment.
All changes to rust code (e.g. diem-node and moveVM, and ol tooling) have code "unit tests" which run as part of the CI.
We run a number of "tooling integration tests" which simulate how a user would interact with 0L tools and the network end-to-end
For every "release candidate" of new code to upgrade the network, we first run this code on our testnet, affectionately called "Rex". A new network is created (a new "genesis") with the previous version number, and a code upgrade is carried out on the network. A number of "pre-flight" checks are conducted.
https://github.com/OLSF/libra/blob/main/ol/documentation/network-upgrades/pre-flight-checks.md
When the QA is complete there is a "canary rollout" of the infrastructural software (diem-node and tools) to some Validators which are also core engineering contributors. These are controlled tests on production machines which ensure that a) the software builds correctly for that environment and b) the software can start and engage with the protocol with monitoring by skilled engineers.
There are different types of upgrades, you can read about them here: https://github.com/OLSF/libra/blob/main/ol/documentation/network-upgrades/upgrades.md
Observability into the network is critical. This is done though processes that people carry out, as as well as automated reporting. This is done in a number of ways on 0L.
The first level of analysis available to end-users is the web Explorer, which displays how the blocks are being produced, and what's the state of accounts. This helps us identify usually issues in policies.
This is a dashboard that validators can run on their nodes to display a number of key checks and statistics about their node and the network.
Each validator also serves a "metrics" server on port :9101/metrics. This is a firehose of data from everything related to consensus and mempool.
Throughout the Diem code there are sophisticated logging tools including Prometheus reports which operators can set up to get instrospection on the nodes.
Incident response in 0L has certain priorities.
- Prevent bad state from getting on chain.
- Prevent loss of state
- Prevent end-user transactions from failing
- Prevent loss of history
Responding to issues in a decentralized and heterogenous environment is very challenging. In many high performance blockchains there is a "foundation" which sponsors much of the engineering and incident response teams. 0L has no such company, or team. Every member of the engineering "working group" is independent. There are no two people that work at the same company. This is in fact a virtue of 0L, the high decentralization.
Thus it requires different procedures for responding to incidents:
Certain members of engineering team, are generally "on call" to alert validators to the issues in the customary channels: Discord.
Upon identifying a potential issue, it is raised in the #validators channel. When an issues is confirmed, it is then raised to the #validators-announcements channel to be broadcast to all.
Certain information will need to be collected from operators. The information will be varied, and change depending on incident.
Google spreadsheets are used to collect information from all nodes in the validator set. These sheets are designed and propagated ad hoc.
Many times a network may need a synchronous and coordinated response from all validator operators. This means scheduling a time when all validators can be on the same synchronous voice call. This obviously is challenging in a decentralized environment. And scheduling constraints lead inevitably to slower response times.
Upgrades that may necessary during the incident response must go through the same procedures as any upgrade, as described above. Cutting corners means potentially creating more harm to the network.
Not everyone can follow the technical nature of incident response. The goal is to communicate clearly and frequently the status to all stakeholders.
All users need dedicated channels to receive regular updates on:
- What is the issue
- What is being done
- When can we expect more information
- What step are we in the incident response strategy.