Thursday, December 13, 2012

Mailstore went down. What happened?

170 comments

What is happening now? (last updated 4 January 1:45 PM EST)

At this time, all Mailstore problems are considered fixed.  New hardware has been deployed, and approximately 75% of users have been migrated.  Additional hardware is still being deployed for the remaining 25% of users.  Onsite and offsite backups are both working properly, and new, faster hardware is expected shortly for onsite backups.

If maintaining the highest degree of access to your new mail is critical, please leave your forwarding address in place for now. The new hardware is still being upgraded, and short downtimes will be scheduled for migrated users in the coming weeks.  More details will be announced as it becomes available.  We do not expect these downtimes to exceed 1 hour, so if an outage of that duration is acceptable to you, you may feel free to remove your backup forwarding at any time.

Due to the nature of this outage (and the cleanup efforts it requires), a credit for 6 months of service has been added to all Mailstore accounts. We are truly, deeply sorry for the inconvenience we know this is causing all of you, and we're particularly mortified that you have lost any mail as the result of a failure on our part.  We appreciate the patience (and kindness) you've shown, and hope we can re-earn your esteem as your email provider.

Overview of the Outage

Mailstore was down from 13 December 12:46 PM EST (-0400 GMT) until 14 December 8:13 AM EST, due to a hardware failure.  Much to our horror and dismay, some fraction of mail destined for Mailstore accounts also bounced at approximately 6 PM, with the errors "Relay access denied" or "mail for mailstore.pobox.com loops back to myself".  If your account was among those who had bounced mail, you will receive an email telling you who sent you the mail and when. 

As of 14 December 8:45 PM EST, all backlogged mail has been delivered. A tiny fraction of accounts (24) had corrupted indexes, that prevented them from logging in.  As of 15 December 10:00 AM EST, all 24 had rebuilt indexes, and all their mail delivered.  If you believe you are still missing mail or cannot log in, please contact us to report it.

The message showing any bounced mail went out 13 December around midnight EST.  If you had added a forwarding address by then, you have already received the message.  If you did not have a forwarding address, that message may still be in the backlog of mail to be delivered.  If you did not bounce any mail, you did not receive a message.

All the original updates are included below, for your reference.

Report from the system administration staff

Obviously, this is an incredibly horrible, extended outage, and we can only give you an explanation, not by any means an excuse.  And the explanation, in short, is, we got caught with our pants down.  We have been doing behind-the-scenes work on Mailstore for the past few months.  As noted in several of the comments below, Mailstore is both a single point of failure, and one of the harder services to fix quickly, because of the massive amount of data it involves.  So, even something as simple as "add more storage" can become challenging when that requires moving to new hardware instead of adding more drives to existing hardware.  

The ongoing projects for Mailstore over the last several months have been: switch the backend processing software (from Cyrus to Dovecot, for additional features and more stability), add a new storage device (which we did, and has been problematic from the get-go), get a replacement storage device for the new storage device (which has not yet been delivered), have a hot onsite backup, and a cold offsite backup.  But doing anything involving copying, moving or reorganizing Mailstore requires downtime, which we have been trying to minimize.  (I know, and see where it got us?)  So, we have been proceeding slowly, migrating accounts individually, and basically holding off on important things to avoid either slowing down your mail access or incurring extended downtimes.

As the performance of the current (quite new!) Mailstore hardware has degraded, and with the replacement not yet on site, we pushed forward to deal with the problem by planning a brief downtime to fail over some services onto less loaded parts of the system. That was going to happen tonight. This put us in a race with the hardware: we had to get to our maintenance window before it failed, because it was clear the planned means of failover ("Plan B") would not work. Mailstore's current workload would utterly overwhelm the failovers.

Unfortunately, we lost the race. This morning, a series of cascading failures, some seemingly entirely unrelated to the existing problems, including the complete corruption of our backup storage device, brought down the Mailstore service in such a catastrophic way that "Plan C" and "Plan D" for recovery were out of the question. We had to cobble together something from parts of plan E through K, and the result was what you'd expect: a number of false starts and unforeseen problems.

We counted on things staying the way they were, at least for a little while, rather than insisting on a downtime much earlier on to prep for this kind of catastrophe.  And, for that, all we can say is how sorry we are.  We are hoping to be back online very soon, and will continue making updates until we are.


What can I do to get my mail?

We recommend adding another forwarding address right away by clicking the "Edit" button to the right of "Your Mailstore Inbox" in the Delivered To column.  Get more detailed instructions on adding a forwarding address.

If you've only used your Mailstore Inbox and aren't sure what a forwarding address is, a forwarding address is an email address at another ISP or provider.  We take mail sent to your Pobox address, and forward it there.  As long as you leave your Mailstore Inbox as one of your other addresses that we deliver to, the mail that we forwarded will also be delivered to Mailstore.

How can I get updates?

We're tweeting updates on the situation as we get them.  You can view them on the web at http://www.twitter.com/pobox (or status.pobox.com), or follow us on twitter @pobox.

What happened?

We had been seeing slowness and errors throughout the day. We had planned an outage for late this evening. But, the problems were growing, and we thought that the problem could be resolved quickly by resetting the storage cluster. We were mistaken.

After we powered down the equipment, it did not come back up.  We have been in touch with the vendor; they currently believe either the power supply backplane or the motherboard needs to be replaced.  We are now working on bringing up the backup hardware.

Bringing up the backup hardware takes a while because there is so much data on Mailstore that needs to be kept in sync.  Unfortunately, this process is somewhat opaque, so it's hard for us to tell how long it's going to take to finish.

Why is it taking so long to bring up the backup hardware?

The backup hardware is underpowered, and we were aware of this. Replacement hardware has been on order, but hasn't yet come in. Unfortunately, this is simply a case of really crappy timing.

What is NOT affected?

Basically, anything besides accessing your mail stored on Mailstore (whether you use your email program, webmail.pobox.com or atmail.pobox.com).  Forwarding, sending mail via SMTP, spam processing (and viewing via the website), and all other website functions should be unaffected.

Update @ 6:26 PM: A number of messages bounced.

As we are bringing boxes up, there was an error.  The IP address for mailstore was picked up by the firewall.  The firewall, not being configured to accept mail, bounced approximately 8,000 messages.  If your mail was among those messages, we will get as much information to you as possible about any lost message. 

I have nothing to say about this, other than I absolutely share the sickening feeling you may be getting.  We pride ourselves on never bouncing or otherwise losing legitimate mail.  I don't even know what to say, except this is the kind of day we have nightmares about.


Update 14 December @ 2:40 AM EST - What is going on?

We are waiting on the remaining slow, clearly degraded hardware to finish making the data available.  Once it is up, it will still be slow.  This weekend, we will be doing everything we can to get off this degraded hardware, once and for all.

In answer to the question, "how did this go so badly wrong?", I can only say that the most horrific words in the English language are: "the backups are completely corrupted."  Rest assured that there will be a complete analysis done of all the myriad ways our existing solutions failed us, in addition to work on the already-planned upgrades.

Update 14 December @ 7:40 AM EST

As this failure stretches on and on, we are working on alternative plans.  Right now, we are setting up a new device, to give you access to the mail that has already come in. However, as seems to often be the case, as you start working on alternatives, the originals near completion.  We hope to have a more definitive report in about 30 minutes.

Update 14 December @ 8:20 AM EST

The storage restore that had been running for the last 12+ hours has finally completed.  Mailstore access has been restored.  It will be slow, and you may still get the password requests we were seeing yesterday will almost certainly still happen.  We will keep you appraised of future steps as we solidify the plans for them.