For two 4 hour periods, Goto was unavailable due to a series of issues in our ride management system. In this postmortem, we'll go over what happened, and what we're putting in place to prevent this issue occurring again.
Goto is designed to support areas with poor data coverage. It's expected for the driver app to be unable to reach the internet throughout the course of a job, so the app automatically stores data locally, and will sync it with our servers when it connects again.
If there are errors syncing, the app will wait a few seconds before retrying. Subsequent failures increase the wait period to prevent devices overloading the servers.
To prevent new data from overwriting old data, before anything is modified, the request to sync the driver app and our servers must first acquire a lock. At any one time, only one request can hold the lock. If multiple requests are received, the first will be granted the lock, and following requests must wait for the lock to be freed before they can be processed.
Should any request take longer than 30 seconds, it is automatically killed to prevent resources being drained.
The Goto API, which powers our mobile apps, was unavailable due to a surge in requests caused by a bug in retry mechanism the Android app.
- Emergency updates are in progress to fix the bug in our retry mechanism. This post will be updated once they're available to download.
- Requesting rides has been temporarily disabled until these updates are out.
- Billing for all affected rides has been disabled. No passengers will be billed, however drivers will receive the usual fare.
- Our monitoring system has been adjusted to make it more sensitive so the team is alerted to errors faster.
Sept 9, 19:39:17
Systems are functioning normally, with several rides in progress.
Sept 9, 19:39:49
A driver app makes two simultaneous requests to sync with our servers. The first request (A) is granted the lock, and is processed as usual. The second request (B) is queued to wait for the lock to complete before it can be processed.
Sept 9, 19:39:50
A third request (C) is received, which is also queued. At this point there are two requests waiting to acquire a lock (B, C), with one request being actively processed.
Sept 9, 19:40:16
Request A completes, having taken 26120ms (26 seconds, considerably longer than usual). Requests B & C are still waiting to get acquire lock, with more subsequent requests also waiting.
Sept 9, 19:41:20
The first request hits our 30s timeout limit, and is killed by the server. Instead of waiting, a bug in our Android app causes it to retry immediately, while also queuing the request to retry in a few seconds.
Due to this bug, each failed request gets retried twice, overloading our server with requests.
Sept 9, 19:41:50
Each retry attempt compounds the issue, hitting a peak of three requests per second per ride. Resources are immediately maxed out, with hundreds of requests waiting for a lock.
Sept 9, 19:46:23
Our monitoring system detects the failed requests, and issues an alert to our on-call team. This alert was not a quick as we'd have liked, and has since been adjusted to be more sensitive to alert the team faster.
Sept 9, 20:11:10
We identify the issue causing request to timeout, and deploy a fix which should have prevented the driver app attempting to sync the ride.
As a temporary precaution, we disable placing requests via the rider app.
Sept 9, 20:12:54
Our fix is deployed, however it doesn't work and the driver app continues to attempt to sync the ride. This was caused by a separate bug, which has been fixed since.
Sept 9, 20:35:44
We identify and deploy a workaround, disabling the requirement for each sync request to acquire a lock. Failed requests start to process successfully, and tail off over the next hour.
Sept 9, 21:30:04
Requests suddenly spike, and start failing again (even though a lock is no longer required). To help cope with demand we add three more servers to the fleet. While requests to the sync server continue to fail, regular requests are now being served successfully.
Failed requests start to gradually fall off, as the server stabilizes.
Sept 9, 23:10:01
Failed requests return to normal.
The requirement for a request to acquire a lock is reinstated.
We believed this was the end of the issue, and continued monitoring into the early morning with no issues.
Sept 10, 07:10:43
Failed requests suddenly pick up, with the driver app from the previous night starting to send hundreds of requests again.
Sept 10, 07:24:26
Our on-call team is alerted and a company-wide conference call begins. Our iOS and Android teams both start work on an emergency update to fix the issues of the previous night.
Sept 10, 08:01:51
A firewall rule is deployed to prevent any requests from the malfunctioning driver app from hitting our servers. This mitigation is effective, but prevents the team from analysing the errors and can only be used once the root cause has been determined.
Sept 10, 08:02:00
Requests return to normal. Our dev teams continue to work on an emergency update, to be released this afternoon.