Last week, we experienced an outage that resulted in our service being unavailable repeatedly during a 4 hours window. During our post-mortem investigation we identified several shortcomings in the way we handled the incident. We would like to share them with you and present the remedies we’re working on right now.
1. Insufficient Monitoring & Diagnostic tools.
While we were able to respond to the incident as soon as it was detected by our monitoring service, we did not immediately understood the cause of the failure. We proceeded to play by the book and ran through our usual procedures. Unfortunately, our actions turned out to be ineffective. Better monitoring and diagnostic capabilities would have been critical in helping us understand and address the issue properly, at the time of the incident, and later on during the post-mortem analysis. We’re now incorporating more metrics into a system dashboard to help us monitor the health of our service and pinpoint failures more quickly.
2. Underpowered Backup Database Server.
We had a number of subsequent very short downtimes in the following hours due to our switch to our backup database server. Some of the improvements we had made recently to our database had not yet been ported over to the backup server (we intentionally staged the upgrade), and as we started relying on it, old and known deficiencies started to surface again. While not critical, it did keep us on our toes much more that we would have wanted to. We’re now moving ahead with the upgrade.
3. Inadequate Downtime Notification.
During the incident, users would often get a blank page after a long time out, while only a few would get the correct “service unavailable” error message. While we work to minimize downtime altogether, we realize it’s important to properly inform users when an incident happens. A time-out is simply not acceptable. We’re going to address this by adding more ways to serve the downtime notice, upstream in our infrastructure.
Regarding the root cause of the incident, everything points to a saturation of our servers, due to a high load and several high-traffic forms uploading large files (>10Mb). There might have been another, still unidentified factor, as we’ve handled similar loads before and load testing indicates that we should have been fine. We’re counting on our improved monitoring to help us address any re-occurence properly, as well as guide our ongoing capacity planning.
This was our worst incident in 2 years. We apologize for the downtime and the inconvenience caused, and are immensely grateful for the patience shown and the support we received. Please let us know if you have any feedback or comments regarding this issue.