Increased Page Load Times
Incident Report for Benevity
Postmortem

Summary

A new cause search functionality for Spark was introduced in December 2021. This change introduced a non-optimized query whose performance was immaterial under normal system load. As Giving Tuesday (Nov 29th) volume rapidly increased around 8:26am MST, this query pushed search performance over the limit causing extensive degradation across multiple pages for 133 minutes.

Impact

All Spark clients started experiencing extremely slow response times on the Cause Search and Cause Profile pages.  While these pages were not completely unresponsive, response times significantly degraded Spark's ability to fulfill our clients' usability for over two hours. 

Root Cause

Spark's cause search engine capacity was exhausted due to non-optimized queries, and load test coverage failed to identify the issue. Initial attempts to add additional capacity to the search service while under load were unsuccessful which contributed to a longer time to remediate the issue.

Future Mitigation

  • Optimize the inefficient query so that it is performant under increased load
  • Increase load test coverage to capture additional use cases around searching for causes
  • Improve processes and training around usage of search queries optimization and how to use them

Timeline of Events

  • 08:26 AM MT - Initial alert received
  • 08:31 AM MT - Major contributing cause identified
  • 08:34 AM MT - Initial attempt to add additional search capacity
  • 09:55 AM MT - Disabled donation progress bar display on Spark for all clients to reduce load
  • 10:22 AM MT - Extra Spark's Cause Search capacity added to allow the system to catch-up on pending requests
  • 10:40 AM MT – Major Client impact solved, system back to acceptable performance
  • 01:04 PM MT - Incident resolved, systems fully operational
Posted Dec 05, 2022 - 11:32 MST

Resolved
This incident has been resolved.
Posted Nov 29, 2022 - 12:56 MST
Monitoring
We have implemented a remediation and have recovered to normal page load times across all Spark sites.

We are continuing to monitor for any recurrence or further effects.
Posted Nov 29, 2022 - 11:16 MST
Update
We are continuing to implement a remediation for the issue.

Dashboard, Cause search, and Campaign pages are returning to normal load times.
Posted Nov 29, 2022 - 10:42 MST
Update
We are continuing to implement a remediation for the issue.
Posted Nov 29, 2022 - 10:29 MST
Update
We are continuing to implement a remediation for the issue.

We have taken action to disable the Donation Progress Bar on Giving opportunities.
Posted Nov 29, 2022 - 09:56 MST
Identified
We have identified the issue and are working to remediate it.
Posted Nov 29, 2022 - 09:34 MST
Investigating
We are currently investigating and issue with increase page load times (latency).
Posted Nov 29, 2022 - 09:16 MST
This incident affected: Benevity Spark.