Skip to content

On Call Engineering Duties

Salim Alam edited this page Jan 19, 2018 · 35 revisions

An On Call Engineer's guide to Habitat in production or "doing it live"

The Habitat team owns and operates our own services. We are responsible for their uptime and availability. To that end, we have a rotating, 72 hour on call period in which each Chef employed team member participates. If you cannot cover your assigned rotation, you are responsible for getting coverage.

PagerDuty alert? Stay Calm, Check the Known Issues section at the end of this doc

What you need before going on call

Additional items

You may occasionally need to access these:

Responsibilities

  • Available to respond to PagerDuty alerts (if you are going to be away from a computer for an extended period, you are responsible for getting someone to take on call).
  • Incident response does not mean you need to solve the problem yourself.
  • You are expected to diagnose, to the best of your ability, and bring in help if needed.
  • Communication while troubleshooting is important.
  • Triage incoming GitHub issues and PRs (best effort - at least tag issues if possible and bring critical issues to the forefront).
  • Monitor #general and #operations as time permits.

More about Chef’s incident and learning review(post-mortems) can be found at https://chefio.atlassian.net/wiki/pages/viewpage.action?spaceKey=ENG&title=Incidents+and+Post+Mortems

During your on-call rotation, it is expected that you won’t have the same focus on your assigned cards and that you might be more interrupt-driven. This is a good thing because your teammates can stay focused knowing that their backs are being watched. You benefit likewise when you are off rotation.

Handing off coverage when you're on call

If you'll be unavailable during a portion of your on-call duty, hand off coverage to another member of the team by setting up a PagerDuty override. See the PagerDuty docs for instructions on how to do that.

SSH access to prod/acceptance nodes

You will need access to the 1Password shared vault (if you do not have this, ask a core team member) and to the Habitat AWS account (should be through an icon in okta - if you don't have this ask the Chef internal help desk)

Copy the "habitat-srv-admin" key from the shared vault. I always put mine in my ~/.aws directory on my workstation.

Use that key to ssh into the prod or acceptance node you wish to.

$ ssh -i ~/.aws/habitat-srv-admin ubuntu@node_public_ip_address

Pro Tip: There are a set of scripts in the tools/ssh_helpers directory in the habitat repo. You can use these to automatically populate all the production (and acceptance) nodes in your ssh config, and can then simply refer to the node by it's friendly name, e.g.: ssh production-builder-api-0.

Current state of Production

Each of the builder services runs in its own AWS instance. You can find these instances by logging in to the Habitat AWS portal. If you do not already have access, ask #helpdesk to add the Habitat AWS Portal app to your OKTA dashboard. Make sure to add the X-Environment column to the search results and search for instances in the live environment.

Current state of Acceptance:

The acceptance environment closely mirrors the live environment with the exception that it runs newer (the latest 'unstable' channel) service releases. Use the AWS portal as described above to locate the acceptance environment builder service instances. The acceptance environment also runs fewer build workers than the production environment.

Troubleshooting:

Historically, trouble in production can be resolved by restarting services. However, it should not necessarily be a first resort - at least in production, services should rarely need to be restarted manually, and ideally should only be done if there is some evidence from reading the logs that a restart may be called for. Here are some generic pointers for exploring the status of production.

Sumologic logs

Sumologic currently aggregates logs from the builder API and build worker nodes. The Sumologic logs (and particularly it's Live Tail ability) are invaluable for troubleshooting, and should generally be one of the first places to start the troubleshooting session.

The key Sumologic queries can be found in the Builder Searches folder:

  • Live API Errors - 15 min
  • Live API Access
  • Live Workers
  • Acceptance API Errors - 15 min
  • Acceptance API Access
  • Acceptance Workers

Make sure you can run these queries, and also be familiar with the ability to look at a Live Tail session for these queries.

Supervisor logs (syslog)

You can read the supervisor output on any of the service instances by ssh-ing into a service node, and running journalctl -fu hab-sup. If you find yourself needing to read production logs, the -fu should roll quite naturally off the finger tips. If there is a specific timeframe when a problem occurred, it is sometimes useful to get the logs from that specific time (UTC) - for example journalctl --since '2017-11-13 10:00:00' -u hab-sup | more will show logs from the specified time.

Restarting services

Most instances just run a single service but there are a couple that run two (or more). Running systemctl restart hab-sup will restart the supervisor itself and therefore all services it runs. You may of course run sudo hab sup stop [service ident] and start to restart individual services. Run sudo hab sup status to determine which services are loaded and what state each service is in.

Here is a brief synopsis of the builder services:

  • api - acts as the REST API gateway
  • api-proxy - is the NGINX front end for the REST API
  • sessionsrv - manages api authentication and also stores information related to individual accounts. If website authentication is not working or if "My Origins" shows up empty, chances are you need to restart this service.
  • originsrv - Manages the "depot" data. If package searches is broken, restart this service.
  • router - this is the hub that routes requests to the appropriate service. If things appear broken and restarting individual services do not resolve site errors, try restarting this service.
  • datastore - Runs the postgres database. Typically this service does not need restarting unless the logs indicate that it is throwing errors. Be aware that sometimes stopping this service does not clean up well and you may need to clean up some lock files before starting. If that is the case, error messages should state which lock files are causing issues.
  • jobsrv - Handles build jobs. If you have clicked on the Request a Build button does nothing, or you get a popup message that the build was accepted but the build output is never displayed, you may need to restart this service. If package uploads fail, this service may also need to be restarted (since it manages a pre-check for packages to make sure there are no circular dependencies being introduced).

Querying the database

  • SSH to the builder-datastore instance
  • su to the hab user: sudo su - hab
  • run /hab/pkgs/core/postgresql/9.6.1/20170514001355/bin/psql postgres

Note that while postgres is running as a habitat service and hab pkg exec can also run psql. hab pkg exec will run psql with busybox bash loaded and some of the psql navigation does not work well.

Each of the builder services occupy their own database. \l will list all databases and you can connect to one using \c <database name>.

Most of the databases are sharded (key exception: jobsrv), so you cannot simply start querying tables until you set your SEARCH_PATH to a shard. This can be potentially challenging. You need to know which shard has the data you are interested in. You navigate to a shard using SET SEARCH_PATH TO shard_<number>;. \dn will list all shards and can be helpful in determining if your database is sharded at all. If it is not, you just need to navigate to shard_0. Once set, you can run queries and also use \dt to list tables.

In the event that you need to run a query accross all shards, you can run something like this:

sudo su - hab
for i in {0..127}; do
     /hab/pkgs/core/postgresql/9.6.1/20170514001355/bin/psql \
        -d builder_originsrv \
        -c "SELECT 'shard_${i}' as shard, name FROM shard_${i}.origins;"; done

HINT: If querying the builder_originsrv database. All data related to the core origin is in shard_30.

Finding the shard id for a thing

Since most of the Habitat services store their data in shards, it is sometimes helpful to know what shard a certain origin, user or account id is stored in. For this purpose, you can use the op tool, which is part of the habitat repo. You will need to build the tool yourself, however it's pretty straightforward (cd components/op; cargo build).

Do a 'op --help' to see the params.

Example (checking the shard for the 'core' origin):

$ op shard core
30

Deploying code

If you are in a position where you need to deploy a fix, the builder services (assuming they are up) makes this easy. Once your fix is merged to master, a build hook will kick in and automatically re-build the packages that need to be updated. Those 'unstable' packages will be picked up and installed automatically on the acceptance environment. Once the fix is validated on acceptance, those packages can be promoted to stable via the hab cli - hab pkg promote or hab bldr job promote commands.

The Sentinel bot

  • The sentinel bot lives at bots.habitat.sh
  • The SSH key and user for logging in are the same as the production instances
  • The sentinel configuration files live at /hab/svc/sentinel
  • To see what the sentinel service is doing, you can tail the logs with journalctl -f -u sentinel
  • If you suspect the sentinel bot is stuck and needs to be restarted, you can run systemctl restart sentinel

Builder issues

For Builder issues, please see: https://github.com/habitat-sh/habitat/wiki/Troubleshooting-Builder-Services

Known Issues

Unable to log in, error 503s, failure to upload packages

The most common issue (particularly in acceptance, but also happens in prod) is a current bug where the directory ownership of some of the builder-api-proxy directories get set to root. This causes error 503s, inability to log in, build and other failures (and also results in PagerDuty alerts).

If this happens, you can confirm by looking at API logs, you will generally see something like:

2017/11/16 23:35:05 [error] 41385#0: *2262823 open() "/hab/pkgs/core/builder-api-proxy/5988/20171027084253/app/undefined/v1/depot/pkgs/core/scaffolding-ruby/latest" failed (2: No such file or directory), client: 10.0.0.206, server: ip-10-0-0-147, request: "GET /undefined/v1/depot/pkgs/core/scaffolding-ruby/latest HTTP/1.1", host: "bldr.habitat.sh", referrer: "https://bldr.habitat.sh/

In order to resolve the problem, ssh into the production-builder-api-0 node and run the ./fix-permissions.sh script that can be found in the home directory. For reference, the script does the following:

sudo chown -R hab:hab /hab/svc/builder-api-proxy/config
sudo chown -R hab:hab /hab/svc/builder-api-proxy/var

If this does not resolve the issue, and you are seeing errors like the following in Sumologic API Errors query:

2017/12/01 19:11:59 [error] 22802#0: *77643 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.0.0.136, server: ip-10-0-0-136, request: "HEAD /v1/status HTTP/1.1", upstream: "http://127.0.0.1:9636/v1/status", host: "10.0.0.136"

then you will need to restart the hab service on the API node (eg, 'live-builder-api-0' or 'acceptance-builder-api-0'):

sudo systemctl restart hab-sup

net:route:3 TIMEOUT errors

Occasionally, there can be issues with users not being able to log in or other functions degraded on the front end. Check the syslog on the builder-api node, and look for TIMEOUT errors. These can indicate that the builder-api-proxy (nginx) is not receiving a response from the builder-api-service.

Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O): ERROR:habitat_net::conn: [err: TIMEOUT, msg: net:route:3], Connection recv timeout
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O): ERROR:iron::iron: Error handling:
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O): Request {
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O):     url: Url { generic_url: "http://backend/v1/authenticate/ca363a29d90875123456" }
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O):     method: Get
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O):     remote_addr: V4(127.0.0.1:59226)
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O):     local_addr: V4(0.0.0.0:9636)
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O): }
Nov 16 00:35:46 ip-10-0-0-147 hab[40751]: builder-api.live(O): Error was: NetError(code: TIMEOUT msg: "net:route:3")

If this happens, a restart of the hab services (sudo systemctl restart hab-sup) on the API node should resolve the issue.

NO_SHARD errors

Another case where there can be service degradation (for example, builds not getting triggered, or other instances) is when services may not be registering with the router. In order to diagnose this, examine the syslog on the production-builder-router-0 node. If you see errors that look like the ones below, a restart of a service may be called for. It's not always clear what service should be restarted though. If a build is not getting triggered, or an upload is failing, try restarting the hab services on the production-builder-jobsrv-0 node. You might also need to look at restarting the originsrv as well.

Nov 17 01:52:34 ip-10-0-0-45 hab[3176]: builder-router.live(O): ERROR:habitat_builder_router::server: [err: NO_SHARD, msg: rt:route:2]
Nov 17 01:52:34 ip-10-0-0-45 hab[3176]: builder-router.live(O): ERROR:habitat_builder_router::server: [err: NO_SHARD, msg: rt:route:2]
Nov 17 01:52:35 ip-10-0-0-45 hab[3176]: builder-router.live(O): ERROR:habitat_builder_router::server: [err: NO_SHARD, msg: rt:route:2]

Missing ens4 network interfaces on workers

Currently, the ens4 network interface on our workers appears to get lost for an as-yet-undetermined reason. The interface can legitimately "disappear" when it has been used in the network namespace of a running build, but it should be returned afterward.

The fix is being worked on, but in the meantime, if you suspect the interfaces have really been lost (e.g., there aren't any builds running on the worker, but there's no ens4 interface present), you can use the probe-workers.sh script to restart the network stack on the offending machines.

To simply probe workers in the live environment, but perform no restarts, run the following:

probe-workers.sh live

(To do so in acceptance, just use acceptance instead of live.)

To then restart the networking stacks of machines without an interface, add the -r option:

probe-workers.sh -r live

NOTE: This script assumes you've set up your SSH configuration using the scripts in the ssh_helpers directory!