Back service outage post-mortem – February 5, 2016

Today, FormAssembly multi-tenant services were unavailable for approximately 36 minutes. Specifically, the management of forms at and the availability of forms via were substantially, if not wholly, unavailable. Our initial response was to ensure that customer data was not at risk and to verify the scope of the outage. Then, we began our investigation. Once we identified the problem, we remediated the problem and restored FormAssembly service 11:26ET with complete restoration of affected systems at 12:28ET.
We apologize for this prolonged, unexpected downtime. We work hard to ensure our availability for all customers, because we know how important FormAssembly is to people and businesses. We’d like to take a moment and explain what happened and let you know what we’re doing to avoid this in the future.

What Happened

At 10:54 AM ET on Friday, February 5th, 2016, PagerDuty alerted the FormAssembly staff of a response failure involving the FormAssembly multi-tenant system. We were alerted because our monitoring systems had detected that FormAssembly was returning a HTTP 503 response code. Upon inspection, we determined that service was partially disrupted and that the classic version of FormAssembly was unavailable and displaying a message regarding a missing configuration file. Additionally, the issue appeared isolated to the resources that powered classic FormAssembly only, as the latest version of FormAssembly remained available for those customers who have migrated away from the classic version.
The FormAssembly service relies upon distributed systems that require accurate timekeeping. Our investigation revealed that this outage was due to one of those distributed systems — a replicated filesystem — having reduced availability. Specifically, a server participating in a replicated filesystem had a clock skew from the other systems.
This caused two problems. First, the skew resulted in the replicated volume having a reduced consistency. Second, the lack of consistency amongst the systems caused the overall system to eject a server from participation in the replicated volume, increasing the load on the remaining servers. Because fewer resources were available, the remaining available resources were overloaded. A load spike was observed in monitoring and the volume became unavailable. caused the replicated volume to become overloaded and the entire volume was unavailable.
Once we identified this cause, we were able to restore service at 11:26ET, with a complete restoration of the replicated volume at 12:28ET. No customer data was present on the affected volume, nor was customer data on the actual servers providing the affected volume. Because of this, no customer data in our possession was at risk due to this issue, nor was customer data possessed by FormAssembly affected by the issue.
Unfortunately, FormAssembly can only control data that we have in our possession. Consequently, if a form respondent attempted to submit a response during this 36 minute window, it is possible that the response was not processed nor saved by FormAssembly. If your account has not migrated to the latest version of FormAssembly, then it is highly likely that any submissions submitted during this 36 minute window were not processed.

What Will Be Done

This incident was due to our failure to have the appropriate processes in place to have timely notification of degraded availability and consistency in our distributed systems. Additionally, one of our systems was not appropriately monitored for a clock skew.
To avoid this in the future, we are updating our internal processes to ensure that time skews are monitored appropriately in all cases. Additionally, we also have improvements to make in ensuring that our distributed systems are operating at full effectiveness.

Don’t just collect data — leverage it.