Degraded service on servers

Incident Report for Expo

Resolved

This incident has been resolved.
Posted Jun 24, 2020 - 06:31 PDT

Update

The website and service outage has been resolved with a more robust load balancing policy, more server replicas, and a new replication policy. The issue was primarily fixed at 6:10 AM PST (2020-06-24).

Cause: High traffic consumed resources faster than the servers could process traffic. This resulted in servers slowing down further due to the backlog of requests.
Solutions: Requests to the API server, including those for server-side rendering of the website, go through a load balancer that sends traffic to servers that have the most capacity. We also have deployed more server replicas to add more capacity, and adjusted the replication policy to scale based on memory usage.
Posted Jun 24, 2020 - 06:30 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 24, 2020 - 06:17 PDT

Update

We are continuing to investigate this issue.
Posted Jun 24, 2020 - 06:08 PDT

Investigating

We are currently investigating the issue.
Posted Jun 24, 2020 - 06:06 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 23, 2020 - 22:07 PDT

Identified

We are investigating
Posted Jun 23, 2020 - 22:07 PDT
This incident affected: Website, Dev Tools API, Classic Update Service (Application Serving API), and Push Notifications Broker.