The following is the incident report for FormAssembly Enterprise outage that occurred on April 18, 2017. We understand this service issue has impacted our valued Enterprise clients, and we apologize to everyone who was affected. No data was lost during this incident, though users were unable to access forms.
From 11.08 AM to 12.56 PM ET, attempt to access FormAssembly resulted in errors and users being redirected to error page. The issue affected a portion of Enterprise clients. The root cause of this issue was an invalid configuration change that triggered database and application server lockout.
All impacted clients have been notified via email.
Timeline (all times Eastern Time)
11.02 AM: Configuration push begins
11.08 AM: Outage begins
11.08 AM: Alerts went out to teams
11.45 AM: Problem identified
12.04 PM: Successful configuration change rollback
12.10 PM: Server restarts begin
12.30 PM: Software and configuration syncs begin
12.56 PM: All services restored and 100% back online
At 11.02 AP ET, a configuration change was inadvertently released to our production environment without first being tested. The change increased the number of database connections from application server to database server by 4 times. This flooded database servers with unusually high number of connection requests, database servers were configured to allow only a certain number of connections. So database servers started sending back “number of allowed connections exceeded” error message to application servers, in turn application servers started queueing up requests eventually running out of physical and swapping memory. All servers started hanging and locking.
Resolution and recovery
At 11.08 AM ET, the monitoring and alerting systems alerted our team who investigated and quickly escalated the issue. At 11.45 AM, the incident response team identified the issue to be related to database configuration change on application servers.
At 12.04 PM ET, we were able to revert back configuration change.
Some of the application servers started to recover, and we determined overall recovery would be faster by a restart of all application and database servers. As a result, we decided to restart servers gradually to avoid possible cascading failures from a wide scale restart. By 12.56 PM all servers were restarted, configuration files were synced with repository and all services were fully restored.
Corrective and Preventative Measures
In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence.
- Enhance current configuration release management to prevent accidental direct release to production. (Completed.)
- Change database server allowed connection limit to be always higher than allowed from application servers
- Integrate database connection monitoring with pager alert mechanism.
- Add a faster rollback mechanism and improve restore process, so any future problems of this type can be corrected quickly
- Develop better mechanism for quickly delivering status notifications during incidents.
FormAssembly is committed to continually and quickly improving our technology and operational process to prevent outages. We appreciate your patience and again apologize for the impact to you, your users and your organization. We thank you for your business and continued support.
– The FormAssembly Incident Response Team