Partial outage of Split web console
Incident Report for Split
Postmortem

Dear Customer,

Please see below a Split Service Advisory (SSA) regarding an incident resulting in a 49 minute outage limited to the app.split.io web console on October 5, 2018. Split provides mission critical services and we treat any action that can cause a service degradation with utmost sensitivity and priority. It is our goal in this SSA to explain our understanding of the cause of the disruption, and to describe the corrective actions that we will be taking.

Executive Background

On October 5th, 2018 (UTC) a retired domain for the Split web console load balancer expired. As a result, visits to app.split.io intermittently served an error page between 06:42 and 07:51 UTC.  No other Split services were impacted. The incident was resolved by updating Split’s domain name service to reinstate the retired domain.

Event Timeline and Detailed Customer Impact

All times are for 2018-10-05:

  • 06:42 UTC - Pingdom identified a failure when attempting to reach  app.split.io, an alert was sent to the engineer on call. Site status updated.

    • Customer Impact: The Split web console was unavailable, with visitors being served an error page.
  • 06:48 UTC - Engineer on call acknowledged alert.

  • 06:56 UTC - Confirmed the outage was limited to the app.split.io web console, and that SDKs and data ingestion was unaffected.

  • 07:07 UTC - Confirmed that the CDN service was observing a high rate of error responses in reaching web server.

  • 07:16 UTC - Confirmed that the web server was not generating corresponding errors.

  • 07:18 UTC - Engineer on call escalated to a Split engineer with CDN expertise.

  • 07:27 UTC - Observed failure in connecting to the DNS record for the web console load balancer.

  • 07:31 UTC - Engineer with CDN expertise comes online.

  • 07:45 UTC - Observed that DNS record that the web console CDN pointed to was missing.

  • 07:49 UTC - The DNS record was updated and CDN purged.

  • 07:51 UTC - Pingdom auto resolves outage. Customer Impact: Incident resolved.

Technical Cause

On August 24th, a configuration change was staged for the CDN serving static assets for the Split web console to begin using a new domain for retrieving those assets. This change was not applied at that time. At some point between August 24th and October 5th, the old domain still in use by the CDN had its DNS record removed. On October 5th, 2018 (UTC) that retired DNS record expired. Calls passing through the CDN then began failing when they could no longer reach the underlying record. The incident was resolved by reinstating the former DNS record to direct the CDN to the static assets once again.

Remediations

  • Reinstate the DNS record to which the CDN was pointed to.
  • Put in place appropriate DNS diagnostic tools and escalation procedures.

Conclusion

The Split team would like to apologize to our customers for any impact you may have experienced as a result of today’s event. We value the trust you've placed in us, and will endeavor to improve on our processes, procedures, and systems.

For further questions, please contact support@split.io.

Posted 10 months ago. Oct 19, 2018 - 14:04 PDT

Resolved
This incident has been resolved. On October 5th, 2018 (UTC) a retired domain for the Split web console load balancer expired. As a result, visits to app.split.io intermittently served an error page between 06:42 and 07:51 UTC. No other Split services were impacted. The incident was resolved by updating Split’s domain name service to reinstate the retired domain. A full post mortem to follow.
Posted 11 months ago. Oct 05, 2018 - 00:51 PDT
Investigating
We are currently investigating this issue.
Posted 11 months ago. Oct 04, 2018 - 23:42 PDT
This incident affected: Web Console.