Versaic unavailable
Incident Report for Benevity
Postmortem

Summary

Following the conclusion of the July 21, 2024 scheduled maintenance, at 00:00 MT Versaic by Benevity was unavailable to users for approximately 7 hours. Further intermittent service degradation extended through to 16:00 MT on July 22.

Impact

All Versaic users were impacted for the duration of the initial outage. During the initial 7 hours, users were unable to access Versaic.

For the period from July 21 07:00 MT to July 22 16:00 MT users may have intermittently experienced slower performance and the inability to complete payments.

Root Cause

The root cause is in two parts:

  1. During scheduled maintenance of a portion of the systems supporting the Versaic application, that component was put into a state of failure by an incompatibility in the code supplied by the vendor and the configuration that Benevity has on that component.
  2. Benevity teams implemented a recovery process to remediate the previous failure. The recovery process solved for the failure of the system. However, during the execution of the recovery process, the configuration of that system was changed such that a portion of the application (primarily the payments integration to Benevity’s platform services) was no longer able to interact with that infrastructure.

Note: This was not related in any way to the global CrowdStrike outage the previous week. Benevity was not impacted by the CrowdStrike outage.

Future Mitigation

  1. We are adjusting our scheduled maintenance for this infrastructure to include additional validation that would allow us to detect and remediate similar failure conditions before the system enters a failure state.
  2. The configuration change made to fix the post failure system interaction is more resilient to failure than the previous configuration. This should reduce the impact should a similar failure occur in the future.

Timeline of Events

July 20, 2024

  • 18:00 MT - Start of Benevity maintenance window
  • 21:40 MT - Versaic health checks fail after system maintenance completes; troubleshooting begins

July 21, 2024

  • 00:00 MT - Benevity maintenance window ends; start of SEV1 incident
  • 06:41 MT - Service is no longer failing; maintenance page removed; repair replication started
  • 10:08 MT - Reports of intermittent payments failures received and confirmed
  • 14:00 MT - Fix applied for some of the payments failures; pending confirmation

July 22, 2024

  • 05:22 MT - Determined that not all payment processes were fixed; further troubleshooting begins
  • 12:04 MT - Applied further remediation to solve for payments issues; received confirmation of successful payments
  • 13:38 MT - Payments failures detected after extended period of no failures
  • 14:57 MT - Determined that a single client facing server was failing; others were successful. Server was removed from serving new requests.
  • 15:59 MT - No failures detected after 14:57. Declared the incident remediated, and entered extended monitoring state.

Jul 23 2024

  • 14:37 - No failures detected since 14:57 the previous day. Declared the incident resolved.
Posted Aug 06, 2024 - 20:28 MDT

Resolved
This incident has been resolved.
Posted Jul 30, 2024 - 14:39 MDT
Monitoring
We believe that we have remediated the outage and Versaic services are working as expected.
Posted Jul 29, 2024 - 16:19 MDT
Update
We are continuing to work through remediation of issues related to the outage this weekend. We are seeing occasional issues preventing users from creating new payments.
Posted Jul 29, 2024 - 15:06 MDT
Update
We are continuing to work through remediation of issues related to the outage this weekend.
Posted Jul 29, 2024 - 13:54 MDT
Identified
We have identified further issues related to the outage from this weekend. We are working through remediation of those.
- We have addressed an issue related to failing payments in the Versaic application
- We are seeing issues with a subset of scheduled report functionality, a resolution for this is pending
Posted Jul 29, 2024 - 12:28 MDT
Monitoring
We have restored user access to Versaic. Users may see some less than optimal performance when using Versaic as recovery processes continue in the background. We expect that to continue through the remainder of the weekend. We are seeing issues with a subset of scheduled report functionality which will persist through Monday.
Posted Jul 28, 2024 - 06:56 MDT
Update
We have restored user access to Versaic. We are still seeing issues with CDN and some scheduled report functionality. Users may see some less than optimal performance when using Versaic as recovery processes continue in the background.
Posted Jul 28, 2024 - 06:46 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 04:46 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 04:15 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 03:43 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 03:09 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 02:38 MDT
Update
We are continuing to remediate the issue.
Posted Jul 28, 2024 - 02:08 MDT
Identified
We believe that we have identified the source of the issue, and are implementing a fix.
Posted Jul 28, 2024 - 01:25 MDT
Update
We are continuing to investigate this issue.
Posted Jul 28, 2024 - 00:52 MDT
Investigating
We are currently investigating an issue preventing users from accessing the Versaic service.
Posted Jul 28, 2024 - 00:13 MDT
This incident affected: Versaic by Benevity (Versaic Production Application).