Thursday, November 4, 2010

The Spam section: an Anatomy of a Downtime

2 comments
For those of you who weren't following along on Twitter, the last 2 days were a bit of a rocky road for the Spam section. At this time, virtually everything has been restored to its proper state, but for those of you who would like to know more, here's what happened.

On Tuesday, November 2nd, the primary server for individual users' spam databases (which is what feeds the Spam section of the website, and the emailed reports) suffered a partial disk failure. All our servers have redundant hardware, but a reboot was required to get everything running again, which caused an outage of about an hour.

In order to more permanently fix this problem, we scheduled a maintenance window for early morning November 4th. Unfortunately, it didn't fix the problem. In fact, it broke harder, with a partially failing disk becoming a completely failing one.

When dealing with a hardware problem that also has a software solution, you always end up asking the same question. Will it take less time to fix the hardware (even if that time is an unknown number of hours) than it would take to write a software solution?

In the case of the Spam section, we know there is really one critical feature: let you release any misidentified spam. There are other nice features -- see what we've picked up for you, let you mark messages reviewed, search, get an overall count of messages identified. But, if you can release a misidentified message, then the section is working, and if you can't, well, it's just broken.

So, after several hours of working on the hardware, we ended up saying, "ok, it's time to try a software solution." This led to the most recent 7 days of spam becoming available at approximately 1 PM EDT (approximately 12 hours after the outage began.) We wanted to make sure than any real mail you had received this morning or late yesterday was available quickly, since we still weren't sure how long it would take to fix the hardware.

As is often the case, though, once the initial crisis is resolved, the permanent fix comes shortly behind. By 2:30, the hardware problem had been resolved. By 4 PM, a full restore of all spam data had been completed. At this time, the Spam section should be its old self again, and, if you hadn't logged in to check your Spam in the last few days, you'd probably never realize anything had happened.

If you have been logging in though, you may experience a few small issues as a result of this problem.

1. Messages you marked "reviewed" or deleted between 1 and 4 PM on November 4th will have their status reset to unreviewed.
2. If you received an emailed report between 1 and 4 PM on November 4th, the "View this message on the web" link, as well as the "Mark this Report Reviewed" button will not work. (The per-message "Release" link will work, though.)

There was even an unexpected upside as a result of this problem. In order to make the spam from the last 7 days available, some changes had to be made to how incoming spam was processed. These changes significantly boosted the processing speed of incoming messages. So, we hope that this will lead to an overall reduction in the time it takes from when we catch a message, to when it appears for review on the web. It's a small change, but a welcome one if the message in question is one you're itching to release.

As always, we appreciate your patience and understanding during this outage. If you see any further issues, please let us know. If keeping up-to-date on outages is important to you, I would encourage you to keep an eye on our Twitter feed (which we have been updating throughout the day today). When the Pobox site is up, you will see our most recent tweet at the top of the Services page, but even when the site is down, our announcements are available on Twitter's website, as texts to your mobile phone, through RSS, and, of course, through the many Twitter applications out there.