Spark unavailable
Incident Report for Benevity
Postmortem

Summary

On September 11, 2024, from 10:29am MT to 11:07am MT, users may have experienced intermittent failures when accessing pages within Spark.

A recent update to Spark, intended to enhance our system monitoring, inadvertently generated a larger volume of log files than anticipated. This excessive logging caused temporary strain on our servers, leading to intermittent page access issues.

Our technical team promptly addressed the issue by replacing the affected server instances, which resolved the immediate access problems. This was then followed by a change to the logging configuration in Spark, ensuring that logs are carefully controlled to prevent this issue from recurring.

Impact

Between approximately 10:29am and 11:07am MT on September 11, 2024, some users may have encountered intermittent difficulties accessing certain pages within their Spark sites. This would have resulted in error messages when trying to load those pages.

While service was not completely interrupted, a number of clients did experience a temporary degradation in service quality.

Root Cause

A recent update to Spark, intended to enhance our system monitoring, inadvertently generated a larger volume of log files than anticipated. This excessive logging caused temporary strain on our servers, leading to intermittent page access issues. The incident response implemented incremental mitigation efforts to quickly return Spark to normal functionality.

Future Mitigation

Benevity is committed to providing our clients with uninterrupted service. To prevent similar issues from occurring in the future, we're taking proactive steps to strengthen our systems:

  • Review and update our system configurations for log rotation to prevent excessive disk usage and ensure log storage efficiency.
  • Update incident run books with detailed instructions on how to handle situations of high log utilization. This should include troubleshooting steps, capacity management, and mitigation strategies.
  • Refine specific application logging flows within the codebase, ensuring that critical functions always take precedence. This will help safeguard against unexpected disruptions and maintain system stability.

Timeline of Events

  • 11 Sep 2024 10:19 MT - Servers report of high space utilization
  • 11 Sep 2024 10:29 MT - Server log files removed to clear disk space
  • 11 Sep 2024 10:29 MT - Error messages start, client impact begins
  • 11 Sep 2024 10:40 MT - Server instance replacement is initiated
  • 11 Sep 2024 11:07MT - Error messages stop, service is fully restored
  • 11 Sep 2024 11:14MT - Server instance refresh completed
  • 11 Sep 2024 12:11MT - Spark logging configuration change completed to reduce logging levels
  • 11 Sep 2024 12:15MT - Incident resolved
Posted Sep 18, 2024 - 14:02 MDT

Resolved
The fix has been successfully deployed, and this incident is now resolved.
Posted Sep 11, 2024 - 14:54 MDT
Monitoring
We experienced a configuration error that caused intermittent access issues and 500 error pages for some users. The configuration has been updated, Spark availability has been restored, and we are closely monitoring the situation while applying additional fixes. We apologize for any inconvenience and appreciate your patience. Please Subscribe to Updates for the most up-to-date information.
Posted Sep 11, 2024 - 11:57 MDT
Update
Spark availability has been restored. We are monitoring and applying additional remediations.
Posted Sep 11, 2024 - 11:20 MDT
Identified
We have identified an issue with spark that is causing a 500 error page to be displayed. We are in the process of deploying a fix.
Posted Sep 11, 2024 - 11:01 MDT
This incident affected: Benevity Spark.