On Monday morning, December 31st 2018, Planning Center Giving was unresponsive for most customers between 8:20am to 9:20am PST.
First and foremost, we’re very sorry about this. Any amount of downtime means we’re missing our mark. As a company and as a product team, we take these incidents seriously.
A little before 8am we started getting automated alerts that something was up locking up the database servers. Right as we started getting our first few support tickets about it, our operations team was already investigating the issue. Once they saw the problem was specific to Giving, they pulled in our product team to investigate.
The culprit was a badly performing database query used when displaying past donor statements. This bit of code has been around for quite a while in Giving, so it seemed weird that it would suddenly cause a bottleneck for the entire system.
After a few unsuccessful attempts to restore service at the database level, we eventually opted to temporarily disable the “Download Past Statements” feature which made use of that query. Temporarily sacrificing the feature felt like a small price to pay to get Giving back on its feet.
Once we disabled the “Download Past Statements” feature, Giving came back online.
Lessons We Learned
Whenever we have a service disruption, our team conducts a post-mortem to figure out what happened and what we can learn. Here are our main takeaways:
- Service restoration should be priority #1: Looking back, we should have been faster to do what we ultimately did – disable the problematic feature for the sake of getting the application back on its feet. When we see something broken, our instinct is to try and fix it. This was a good reminder that fixing an issue should come secondary to restoring service.
- Better preparation before the holiday break: Every Planning Center application has specific times of the day, week, or year when there’s a spike in usage. For Giving, December 31st is a big day for obvious reasons. Next year, we’ll take extra time before the holiday break to do some performance checks to ensure we’ll be ready for your end-of-year donations.
That same day, Giving went on to process more donations in a single day than it ever has before. Millions of dollars and tens of thousands of donations were processed without incident before the ball dropped in Times Square. Talk about a rollercoaster of a day!
We’ll learn from this experience, improve the system, and keep pressing forward. 99.94% uptime means we're .06% away from our goal. Thank you for sticking with us.