Reduced api and site availability

Incident Report for Expo

Postmortem

Phase 1: 7pm PST November 26th to 4:10am PST Nov 27th

No-user facing impact

At 7:04pm PST on the 26th, one of our engineers - who I will call "me" or "I" - shut down a database which our api and site servers no longer read or wrote to, but for which they still were running local query routers. These routers immediately started failing, and could not recover. Since our site and api servers were no longer using them, and since they had no liveness probe defined, this failure had no effects on traffic, and was not noticed by my manual testing or our automated monitoring.

Containers run in kubernetes can optionally have a “liveness probe,” which is either a command to run inside the container, or an http request to run against it. If a container has a liveness probe, kubernetes will execute these probes on a regular basis, and mark the container unhealthy if they fail. Kubernetes runs containers in sets called "pods," and only considers a pod healthy if all its containers are healthy.

If the routers had liveness probes, those probes would have started failing when I shut down the old database, which would have caused kubernetes to mark all of our site and api pods as unhealthy, which would have caused our web servers to stop sending traffic to them, which would have been immediately detectable.

Kubernetes containers can also have a readiness probe, which is defined the same way as a liveness probe, but which is only executed while the container is starting up in order to determine when it, and by extension its pod, is ready to do work.

The router containers did have readiness probe, which did fail due to the database service being gone, which meant that no api server pods that kubernetes started would ever be considered ready to receive requests.

Phase 2: 4:10am to 4:40am Nov 27th

Impacted users see gradually slower requests, and more frequent timeouts and 503 status responses

Our services tend to see increased traffic during the European and American afternoons, and decreased traffic between those times. Over night, reduced traffic automatically scaled our pods down, but as European developers got to work at 4:00 am PST (12 or 1pm in Europe), kubernetes could not start more: all the ones it attempted to do so would fail their readiness probes.

The few pods which had continued to run the whole night until this point began fielding more and more requests, more and more slowly, using more and more memory, sometimes even being killed and restarted for exceeding their memory limits.

Phase 3: 4:40am to 5:09am Nov 27th

Impacted users see extremely low availability

At this point, automated alerts were sent to the engineering team, eventually including myself. I woke, diagnosed the problem, and removed the unneeded routers from the definition of our pods.

Phase 4: 5:09am to 5:15am Nov 27th

Impacted users see reduced availability, and more frequent timeouts and 503 status responses

As the changes took effect, new pods were rapidly started, returning availability to normal, ending the incident

Work done in response to incident

Our site and api pods include query routers for the new database system we have migrated to. The containers of those routers, and all similar sidecar containers across our infrastructure, have both liveness and readiness probes defined.

I switched my alert notification sound to a louder one, and made sending yourself a test notification - and confirming that it would be loud enough to wake you up - part of our internal checklist for going on-call.

Posted Dec 03, 2019 - 15:10 PST

Resolved

From roughly 4:40 to 5:10 AM Pacific standard time, a large percentage of requests to expo.io failed. A fix was deployed and traffic has returned to normal.

Posted Nov 27, 2019 - 05:22 PST

This incident affected: Website and Dev Tools API.