Skip to content

Incident Response

0o-de-lally edited this page Dec 22, 2021 · 2 revisions

How 0L does incident response

Platform outages are a fact of life. The only software that doesn't fail is software that has no users. What matters are the plans that are in place to

  1. prevent outages in the first place
  2. tooling and processes to identify issues early
  3. respond to issues decisively without causing more harm
  4. creating confidence in the response

Prevent outages in the first place

As is customary with large platforms extensive testing is conducted on every 0L release.

Continuous integration:

We have added hundreds of "continuous integration" tests to the code base. This means that on every new change to the code, a battery of tests are run automatically. New code is only "merged" to the main code when these pass.

Functional Tests Move:

These tests are similar to "unit tests" in other languages, where you test individual behavior of each functions in the "smart contracts" that control the protocol policies. In the Move environment this is don't by simulating transactions on the network.

End-to-End tests in Move:

More involved testing of features in the protocol will require specific tooling to be built to create certain scenarios (how many users, validators, etc on the chain). This is done through an end-to-end testing environment.

Rust unit tests:

All changes to rust code (e.g. diem-node and moveVM, and ol tooling) have code "unit tests" which run as part of the CI.

Integration Tests:

We run a number of "tooling integration tests" which simulate how a user would interact with 0L tools and the network end-to-end

Quality Assurance on Testnet:

For every "release candidate" of new code to upgrade the network, we first run this code on our testnet, affectionately called "Rex". A new network is created (a new "genesis") with the previous version number, and a code upgrade is carried out on the network. A number of "pre-flight" checks are conducted.

https://github.com/OLSF/libra/blob/main/ol/documentation/network-upgrades/pre-flight-checks.md

Canary Roll Out

When the QA is complete there is a "canary rollout" of the infrastructural software (diem-node and tools) to some Validators which are also core engineering contributors. These are controlled tests on production machines which ensure that a) the software builds correctly for that environment and b) the software can start and engage with the protocol with monitoring by skilled engineers.

Roll Out

There are different types of upgrades, you can read about them here: https://github.com/OLSF/libra/blob/main/ol/documentation/network-upgrades/upgrades.md

Tooling and processes to identify issues early

Observability into the network is critical. This is done though processes that people carry out, as as well as automated reporting. This is done in a number of ways on 0L.

Explorer

The first level of analysis available to end-users is the web Explorer, which displays how the blocks are being produced, and what's the state of accounts. This helps us identify usually issues in policies.

Validator Web Monitor

This is a dashboard that validators can run on their nodes to display a number of key checks and statistics about their node and the network.

Metrics Port

Each validator also serves a "metrics" server on port :9101/metrics. This is a firehose of data from everything related to consensus and mempool.

Prometheus Logging

Throughout the Diem code there are sophisticated logging tools including Prometheus reports which operators can set up to get instrospection on the nodes.

Respond to issues decisively without causing more harm

Incident response strategy

Incident response in 0L has certain priorities.

  1. Prevent bad state from getting on chain.
  2. Prevent loss of state
  3. Prevent end-user transactions from failing
  4. Prevent loss of history

Decentralization

Responding to issues in a decentralized and heterogenous environment is very challenging. In many high performance blockchains there is a "foundation" which sponsors much of the engineering and incident response teams. 0L has no such company, or team. Every member of the engineering "working group" is independent. There are no two people that work at the same company. This is in fact a virtue of 0L, the high decentralization.

Thus it requires different procedures for responding to incidents:

Have dedicated responders on call

Certain members of engineering team, are generally "on call" to alert validators to the issues in the customary channels: Discord.

Upon identifying a potential issue, it is raised in the #validators channel. When an issues is confirmed, it is then raised to the #validators-announcements channel to be broadcast to all.

Operator Reporting

Certain information will need to be collected from operators. The information will be varied, and change depending on incident.

Google spreadsheets are used to collect information from all nodes in the validator set. These sheets are designed and propagated ad hoc.

Coordinated Actions

Many times a network may need a synchronous and coordinated response from all validator operators. This means scheduling a time when all validators can be on the same synchronous voice call. This obviously is challenging in a decentralized environment. And scheduling constraints lead inevitably to slower response times.

Upgrades

Upgrades that may necessary during the incident response must go through the same procedures as any upgrade, as described above. Cutting corners means potentially creating more harm to the network.

Creating confidence in the response

Not everyone can follow the technical nature of incident response. The goal is to communicate clearly and frequently the status to all stakeholders.

All users need dedicated channels to receive regular updates on:

  1. What is the issue
  2. What is being done
  3. When can we expect more information
  4. What step are we in the incident response strategy.
Clone this wiki locally