Please see below a Split Service Advisory (SSA) regarding an incident resulting in a 49 minute outage limited to the app.split.io web console on October 5, 2018. Split provides mission critical services and we treat any action that can cause a service degradation with utmost sensitivity and priority. It is our goal in this SSA to explain our understanding of the cause of the disruption, and to describe the corrective actions that we will be taking.
On October 5th, 2018 (UTC) a retired domain for the Split web console load balancer expired. As a result, visits to app.split.io intermittently served an error page between 06:42 and 07:51 UTC. No other Split services were impacted. The incident was resolved by updating Split’s domain name service to reinstate the retired domain.
Event Timeline and Detailed Customer Impact
All times are for 2018-10-05:
06:42 UTC - Pingdom identified a failure when attempting to reach app.split.io, an alert was sent to the engineer on call. Site status updated.
06:48 UTC - Engineer on call acknowledged alert.
06:56 UTC - Confirmed the outage was limited to the app.split.io web console, and that SDKs and data ingestion was unaffected.
07:07 UTC - Confirmed that the CDN service was observing a high rate of error responses in reaching web server.
07:16 UTC - Confirmed that the web server was not generating corresponding errors.
07:18 UTC - Engineer on call escalated to a Split engineer with CDN expertise.
07:27 UTC - Observed failure in connecting to the DNS record for the web console load balancer.
07:31 UTC - Engineer with CDN expertise comes online.
07:45 UTC - Observed that DNS record that the web console CDN pointed to was missing.
07:49 UTC - The DNS record was updated and CDN purged.
07:51 UTC - Pingdom auto resolves outage. Customer Impact: Incident resolved.
On August 24th, a configuration change was staged for the CDN serving static assets for the Split web console to begin using a new domain for retrieving those assets. This change was not applied at that time. At some point between August 24th and October 5th, the old domain still in use by the CDN had its DNS record removed. On October 5th, 2018 (UTC) that retired DNS record expired. Calls passing through the CDN then began failing when they could no longer reach the underlying record. The incident was resolved by reinstating the former DNS record to direct the CDN to the static assets once again.
The Split team would like to apologize to our customers for any impact you may have experienced as a result of today’s event. We value the trust you've placed in us, and will endeavor to improve on our processes, procedures, and systems.
For further questions, please contact firstname.lastname@example.org.