Delayed impressions webhook delivery

Incident Report for Split by Harness

Postmortem

Summary

Between 29 October 2025, 15:45 UTC and 30 October 2025, 03:45 UTC 2025, customers using the Outgoing Webhook (Impressions) integration experienced delays in receiving impression data.
A small subset of customers hosted on Microsoft Azure would have stopped receiving impression events due to persistent failures on their side to receive these events. Still, those customers can export data from our platform.

All other services, including APIs and SDKs, remained fully operational.

Root Cause

During the Microsoft Azure outage, customers hosted on Azure experienced persistent delivery failures for their impression events. These repeated retries resulted in longer processing times in our webhook delivery system, which in turn caused delays for other customers whose destinations were not impacted.

We initially chose to wait before stopping deliveries to the affected Azure customers. Once we confirmed the issue’s scope, we temporarily paused deliveries to the failing destinations, allowing delivery for all other customers to recover quickly.

Customers whose deliveries were paused can still retrieve their impression events for the impact window by exporting them directly from our platform. We also resumed sending events to all of our customers right after the Azure outage was mitigated.

Timeline

Time (UTC) Event Description
29/10/2025 15:45:00 UTC Azure goes down; internal monitoring alerts delays in delivering webhook events to customers.
29/10/2025 16:15:00 UTC On-call paged.
29/10/2025 16:55:00 UTC Root cause identified as an ongoing Azure outage.
29/10/2025 19:42:00 UTC Signs of recovery in Azure system; delay starts to decrease, and response times improve.
29/10/2025 22:40:00 UTC Blocked 100% of webhook traffic to one failing customer to speed recovery.
30/10/2025 01:15:00 UTC Azure incident declared fully remediated.
30/10/2025 03:45:00 UTC Resumed delivering webhook impression events to all customers; internal monitoring confirmed restoration of real-time delivery. Incident closed.

Remediation

We initially waited for the affected Azure-hosted customers to recover as the Azure outage progressed. However, to maintain timely delivery for all other customers, we temporarily paused event deliveries to a small number of endpoints that continued to experience failures. Once Azure services stabilized, we resumed normal delivery to all customers before closing the incident.

Action Items

  • We are implementing automatic noisy-neighbor protections in the Impressions Webhook integration to isolate failing destinations and prevent cross-customer impact.
  • We are reviewing similar delivery patterns across our platform to identify and proactively mitigate any comparable risks.
  • We have updated our operational runbooks to enable faster, more decisive action when customer endpoints cause widespread delays, helping ensure continued reliability for all customers.
Posted Nov 04, 2025 - 17:48 PST

Resolved

This incident has been resolved.
Posted Oct 29, 2025 - 21:48 PDT

Update

The Azure incident has been resolved.
Following the resolution, we’re seeing a significantly faster reduction in lag built up in delivering impression events via webhook.
We’re continuing to monitor system performance to ensure full recovery and stability.
Posted Oct 29, 2025 - 19:15 PDT

Monitoring

We continue to monitor the situation closely..
We’re seeing signs of improvement in delivery delays for customers relying on impression webhook events
Our team will keep tracking progress and provide further updates as recovery continues.
Posted Oct 29, 2025 - 15:25 PDT

Identified

There's no impact to core Harness FME SaaS capabilities.

However, due to an ongoing Azure outage, customers who rely on impression webhook events may experience longer than usual delays in delivery.

Our team is monitoring the situation closely and will continue to post updates here.
Posted Oct 29, 2025 - 11:49 PDT
This incident affected: Integrations.