Skip to content
Robert Christensen edited this page Mar 5, 2014 · 1 revision

Rationale

Since cmdr depends on many external frameworks, it's impossible to verify that any of those black-box components is not going to fail at any given point. We could try to anticipate error-prone situations by hard-coding relevant responses into cmdr, but that would result in bloated code, is not going to scale well, will not be easily adaptable to changing circumstances, and may still be insufficient in eradicating all possible errors.

A better solution would be to take those concerns out of the main system and make a separate module responsible for them. That is Watchman. It's monitoring solution that detects failures and takes the right actions to recover from them. It samples the system's behaviors (such as the load average) and monitors the system's state changes (such as the network interface going down), looks up a user-specified set of rules that dictate what actions it should take in response to those behaviors, and performs those actions.

In order to spot performance issues, the system must be in use. But we'd like to detect and fix performance issues before those appear during a lecture. So we use tests. While the classroom is not in use, we simulate common (and uncommon) usage case scenarios, during which we monitor the performance of the system. If cmdr passes those tests successfully, we can be fairly confident that it won't fail during the real deal. Tests allow us both to fine-tune our response rules and deal with more complicated issues that require changes to the source code, which would be very difficult to correct on the spot.

Specification

Interface

Interfaces describe a single component of the system (e.g., a process), by specifying the its behaviors, states, and available actions. Of course, not all behaviors, states and actions are associated with an interface. But interfaces give us an object-oriented logic that will make the rule specifications easier to read and write.

Behaviors

A behavior is simply a variable that changes continuously over time. It is specified by the sampling method used to get the current value of that behavior.

Behaviors are sampled lazily; that is, if no active rule cares about some behavior, that behavior will not be sampled at all. Moreover, if a rule is applied only at specific instances of time and not continuously (more on that later), behaviors will be sampled only at those instances.

States

States are ON/OFF variables that describe some aspect of the system.

Instead of being sampled, a state changes simply in response to an event occuring.

Tests

Tests simulate specific cmdr usage cases (e.g. "open projector and switch to the VCR") and receive results from multiple sources (e.g. polls the projector for its state, checks whether database has been updated) to determine whether cmdr successfully performed the test, returning success/fail.

The purpose of tests is to identify whether the system works (determined by the outcome of the tests), whether the system works well (determined by measuring the performance of the system during tests), and whether the current rules are sensible (e.g., whether they are not activated with unreasonable frequency).

Rules

Rules are formed by combining a predicate (composed of behaviors, states and tests) with a set of actions. When the predicate is true, the actions are executed sequentially.

  • Time: rules can either apply at all times, apply only at specific repeated times ("every Monday at 12.00AM") or apply at one-time instances ("1/13/11 12.30AM").
  • Operators: rules can be combined with conjunction and disjunction operators, and also be negated.
  • Behaviors: rules can perform checks on the value of a behavior, or the behavior's rate of change.
  • States: rules can check whether the system or some interface is in a specific state or not.
  • Tests: rules can check whether a test has been completed successfully or not.

Notifications

A notification is the action of sending an e-mail or a text message to a person. The notification should include information on the rule that led to the user being notified, as well as a summary of watchman's operation at that day.

Operation

Shell

Watchman should never have to restart. The user should be able to change its behavior and view performance metrics through a simple command-line shell. The shell should offer simple tab-completion and be accessible through the network.

Syncing

An environment specifies the set of rules that are active during watchman's operation. A centralized repository of environments will be on github. At any point, one can replace the current environment with some environment from the repository. Daily backups ensure that the environment associated with a specific room isn't lost.

Available tests, behaviors, states and actions are also stored in a centralized repository, and are updated daily for every room. All of them are available to all clients; no reason to allow customization here.

Notes:

Useful for testing this is setting up a SSH tunnel from your local machine to a cmdr box. This makes localhost:1412 actually send data to cmdr-host:1412 so you can easily test.

ssh -NL 1412:localhost:1412 [email protected]