Metrics Processing Failure
Incident Report for Split
Postmortem

Subject: Split Service Advisory - Metrics Processing Failure - 2018-04-05

Dear Customer,

This is a Split Service Advisory (SSA) regarding an incident which prevented Split from publishing Split metrics data between 2018-04-04 10:20 PST and 2018-04-04 18:53 PST. Split provides mission critical services and we treat any action that can cause a service degradation with utmost sensitivity and priority. It is our goal in this SSA to explain our understanding of the cause of the disruption, and to describe the corrective actions that we have taken.

Incident Background

At 2018-04-04 10:20 PST, customers’ results metrics cards stopped processing event data. The calculation job in charge of the statistical calculation stopped processing requests as a result of a single character bug that was introduced into a log line. Split’s support team and pre-sales team flagged the issue to Split engineering promptly at 13:30 PST and a team of data engineers were deployed to begin investigating. A manual recalculation was done at 15:45 PST to immediately update metrics cards and resolve the immediate issues. A fix was applied and pushed to our production instances by 2018-04-04 18:53 PST. There was no data loss and all calculations have been updated throughout the outage.

Event Timeline and Detailed Customer Impact

  • 2018-04-04 10:20 PST - Split’s statistical calculation stopped processing requests.
    • Impact - results metrics cards stopped processing updates. Note that new experiments will refresh in shorter time windows compared to longer running experiments which will refresh over longer time windows.
  • 2018-04-04 13:00 PST - Multiple customers began inquiring regarding metrics not changing across pre-sales and support channels. Support and pre-sales teams began investigating.
  • 2018-04-04 13:30 PST - Initial triage complete and data engineering team assembled to begin investigation.
  • 2018-04-04 15:30 PST - Data engineering team was able to identify the root cause.
  • 2018-04-04 15:45 PST - Data engineering team triggered a manual refresh all statistical calculations across Split’s customer base.
    • Impact - results metrics cards updated to reflect accurate data resolving immediate customer impact within three hours of initial customer reports.
  • 2018-04-04 16:45 PST - Fixes were applied in our development environment. Began testing and deployment processes.
  • 2018-04-04 18:53 PST - Deployed production fixes.
  • 2018-04-04 19:00 PST - Validated fixes on production, resolving incident.
    • Impact - No customer data was lost.

Technical Cause

The calculation job in charge of the statistical calculation stopped processing requests as a result of a single character bug that was introduced into a log line. Our systems are built to detect alerts on log lines, but due to a modification the exception fired silently and escaped from our log captures from which our alerting systems are triggered causing the issue to go unnoticed for a prolonged period of time.

Remediation

The Split team pushed a fix that re-enabled the statistical calculation. The events from the period impacted have been identified, and all metrics have been updated to be accurate throughout the outage period.

Our customers’ data and trust in our ability to analyze this data is our highest priority. Further, our customers should not notice issues before we do. We have taken three primary steps to uphold this promise that any issues in processing are caught in near real-time by Split engineers, that proper customer channels are updated, and that Split’s product more easily notifies customers of any processing delays.

  • (1) We have since fixed our log captures to not silence this particular exception and our data engineering team has added extra measures to catch alerts in other areas of our system beyond log lines to harden our alert channels.
  • (2) We have drafted policies and procedures for our support team to follow when an issue is identified to ensure that Split’s status page at status.split.io is updated in near real-time as soon as an issue impacting customers is identified.
  • (3) We have added “last refresh date” to our near-term product roadmap bringing the last time a metric board was updated to the page for our customers.

Conclusion

We would like to apologize to our customers for any impact you may have experienced as a result of this event. We value the trust you've placed in us, and will endeavor to improve on our processes, procedures, and systems.

For further questions, please contact support@split.io.

Posted Apr 05, 2018 - 11:44 PDT

Resolved
At 2018-04-04 10:20 PST, customers’ results metrics cards stopped processing event data. The calculation job in charge of the statistical calculation stopped processing requests as a result of a single character bug that was introduced into a log line. This issue has since been resolved.
Posted Apr 05, 2018 - 11:38 PDT