Impressions & events processing failure

Incident Report for Split by Harness

Postmortem

Subject: Split Service Advisory - Events Failure - 2018-10-05

Dear Customer,

Please see below a Split Service Advisory (SSA) regarding an incident resulting in 35 minutes partial outage on our events API on October 5, 2018. Split provides mission critical services and we treat any action that can cause a service degradation with utmost sensitivity and priority. It is our goal in this SSA to explain our understanding of the cause of the disruption, and to describe the corrective actions that we will be taking.

Executive Background

On October 5th, 2018, a failure in our autoscale configuration and script combined with an increase in our event traffic led to the failure of our event service and the partial loss of event and impression data between the time of 18:02 UTC and 18:37 UTC. During this time window the metric impact that was reported back would be incorrect, which would correct itself with subsequent data. At no point was your customers’ experience impacted.

Event Timeline and Detailed Customer Impact

2018-10-05:

17:08 UTC - Significant increase in events volume.
17:10 UTC - Kinesis starts pushing back on event load
18:02 UTC - ‘Pingdom - Data Collection Services is Down’ alert. Event servers are running out of memory and become unresponsive. Autoscale attempt to restart the hosts. Team is alerted.
18:10 UTC - Autoscaling group failed to deploy additional machines to load balancer
18:12 UTC - Event servers got manually restarted
18:37 UTC - ‘Pingdom - Data Collection Services is up’ notification

2018-10-06

01:21 UTC - rolled out fix to deployment script and restarted servers

Technical Cause

Several technical causes were at play:

Our deployment script was pushed to production with a bug that made it fail
Our event service autoscale was configured to maintain our current number of instances but was not configured to increase scale automatically when CPU or Memory were high.
Our event service is lacking a defense mechanism when incoming event load is too high.

Remediations

Fix deploy script, complete
Review process for promoting deploy script to production to assure sufficient testing
Fix autoscaling policy to autoscale appropriately when memory or CPU usage is high
Add rate limiting capability to events service

Conclusion

The Split team would like to apologize to our customers for any impact you may have experienced as a result of today’s event. We value the trust you've placed in us, and will endeavor to improve on our processes, procedures, and systems.

For further questions, please contact support@split.io.

Posted Oct 11, 2018 - 18:02 PDT

Resolved

This incident has been resolved. We had a partial event and impression processing failure. As load grew, our autoscaling failed to grow the cluster adequately which led to nodes being overwhelmed. We resolved the issue by bringing up manually additional nodes and subsequently resolved the issue. A full post mortem to follow.

Posted Oct 05, 2018 - 11:37 PDT

Investigating

We are currently investigating this issue.

Posted Oct 05, 2018 - 11:02 PDT

This incident affected: Data Processing.