Dear Customer,
This is a Split Service Advisory (SSA) regarding an incident which prevented Split from publishing Split metrics data between 2018-04-04 10:20 PST and 2018-04-04 18:53 PST. Split provides mission critical services and we treat any action that can cause a service degradation with utmost sensitivity and priority. It is our goal in this SSA to explain our understanding of the cause of the disruption, and to describe the corrective actions that we have taken.
Incident Background
At 2018-04-04 10:20 PST, customers’ results metrics cards stopped processing event data. The calculation job in charge of the statistical calculation stopped processing requests as a result of a single character bug that was introduced into a log line. Split’s support team and pre-sales team flagged the issue to Split engineering promptly at 13:30 PST and a team of data engineers were deployed to begin investigating. A manual recalculation was done at 15:45 PST to immediately update metrics cards and resolve the immediate issues. A fix was applied and pushed to our production instances by 2018-04-04 18:53 PST. There was no data loss and all calculations have been updated throughout the outage.
Event Timeline and Detailed Customer Impact
Technical Cause
The calculation job in charge of the statistical calculation stopped processing requests as a result of a single character bug that was introduced into a log line. Our systems are built to detect alerts on log lines, but due to a modification the exception fired silently and escaped from our log captures from which our alerting systems are triggered causing the issue to go unnoticed for a prolonged period of time.
Remediation
The Split team pushed a fix that re-enabled the statistical calculation. The events from the period impacted have been identified, and all metrics have been updated to be accurate throughout the outage period.
Our customers’ data and trust in our ability to analyze this data is our highest priority. Further, our customers should not notice issues before we do. We have taken three primary steps to uphold this promise that any issues in processing are caught in near real-time by Split engineers, that proper customer channels are updated, and that Split’s product more easily notifies customers of any processing delays.
Conclusion
We would like to apologize to our customers for any impact you may have experienced as a result of this event. We value the trust you've placed in us, and will endeavor to improve on our processes, procedures, and systems.
For further questions, please contact support@split.io.