Friday, October 30, 2009

How and Why Downtimes are Scheduled at Pobox

Very early next Tuesday morning (or late Monday night for you night owls), Pobox has a Mailstore downtime scheduled from midnight to 4 AM (which is a really long outage for us!) I asked our Operations Head, Bryan Allen, to explain a little bit about downtimes, including the many, many changes that don't require downtimes, as well as what kinds of changes do.

Here at Pobox we try to minimize outages. In fact, we kind of have a thing about it. We keep the spice flowing, and we try to ensure you can access your mail. Periodically, we generate planned outages for software or hardware upgrades.

The vast majority of the Pobox service infrastructure is redundant and to an extent, self-healing; it requires no human interaction when badness occurs. The rest is replicated but requires manual intervention for failover. The core databases are an example of that: We have failover replicas, and we can fail over to any one of them within minutes with a minimum of service impact and no performance degradation once the failover has finished.

Our software upgrades are relatively benign these days. We push new Perl code to production several times a week, and we patch our operating systems regularly (thanks to the magic of Solaris LiveUpgrade, this would create an outage of a few minutes if most services weren't already redundant), and so on. We very, very rarely have service outages that are caused by core software.

On rare occasion (especially in the last several years), we'll hit an unplanned outage on a unique service. Those are Big Fires, and there's only one goal there: To fix the problem and restore service. Hardware outages, simply due to intrinsic orneriness, are harder to both plan for and recover from. Sometimes bugs can generate an outage: Recently a bug in the Mailstore authentication code made it so new clients could not authenticate and access their mail. That was a regression, and the fix was trivial.

Conversely, a planned outage is a declaration of intent: It says we are going to create an interruption of service some some specific, defined reason. At times, the intention is to avoid an unplanned outage (a fire) at some point in the future. Usually, it's to improve the service in some way.

Tuesday's outage is for a relatively major hardware upgrade.

The X4100 M2s running the Mailstore storage are somewhat older dual-core Opterons. We haven't quite hit the wall for their CPU running Mailstore, but we can see the dust on the horizon. Very early Tuesday morning, we're going to be swapping the X4100s out for X4150s (dual quad-core Xeons). That change alone will see us through for quite a long time. In addition to faster CPU and bus, however, the X4150s can take double the RAM (doubling the filesystem cache) and have 8 SAS bays (four free, currently, per-system). This will let us a build a Hybrid Storage Pool, redirecting the filesystem journal writes to an write-optmized SDD, and building an L2 filesystem cache on a read-optimized SSD. To put it very mildly: Zoom. In the future we'll want a way to upgrade the storage head nodes without taking the service offline, but currently the architecture doesn't allow for it.

In addition to snapshots (which we utilize for data recovery and streaming replication), ZFS comes with built-in compression. The bottleneck for disk access, is well, spinning rust. Regardless of the speed of the disks you use, and the size of the your filesystem cache (which is currently 16GB for Mailstore), you still have to retrieve bits from a platter. And that's slow. So why is compression a good thing? Won't compressing files consume CPU? Isn't CPU still a valuable resource? It is, but these days your fileservers CPUs are likely to be sitting relatively idle, while their disks are thrashing. If you compress the bits you write to disk, you have less to read and write, and get far more I/O per second, basically for free.

(Back when we first refactored the discards storage system, it was taking forever to write the user indexes to disk. Enabling compression increased performance by at least threefold.)

For this upgrade, the vast majority of work can be done before the maintenance window even starts. We use Puppet from Reductive Labs to manage our systems, and we encapsulate services in Solaris Zones. Put that together, and you have the ability to quickly provision services on new hardware without actually doing any work. So the new zones are all already running and configured, just waiting for the storage pools to be mounted. It really is as easy as it sounds.

Regardless of how easy it sounds, for every outage you want a pullback plan. If your upgrade totally fails for some reason, you want the ability to just put things back the way they were. In this case, the plan is: Plug the hardware into the old box.

We scheduled this window for four hours because I'm paranoid. You'll see this a lot when any business is updating a core piece of infrastructure. When our datacenters are updating their routers, they announce a six hour window where connectivity may flap. You may have noticed that our connectivity does not actually flap regularly, because they're pushing out updates they've tested in their labs already, and are reasonably sure everything will be full of joy. Badness does occur, though, and when it does it has a very bad habit of avalanching. So you want a defined window to try to resolve the problem and still complete your task, before you have to give up, put the old pieces back into place, and try again another day.
To recap the services that will be affected by this downtime, Mailstore customers will not be able to access webmail, their Mailstore folders or receive new mail to their Mailstore Inbox during the outage window. If you are a Mailstore customer, and also forward your mail to another address, your forwarded copy will be delivered throughout the outage without delay. You will also be able to send mail. If you are an IMAP user, and keep a local copy of your mail on your computer, you will be able to read your local copies, and any changes you make (deleting, moving messages, etc.) will be synced to Mailstore when the downtime ends.

We apologize for any inconvenience this may cause, and Bryan did ask me to tell you that, if everything goes smoothly, the downtime will be much shorter than the scheduled window. He just doesn't like to bank on that.

1 comment:

  1. Hi - for your customers on the other side of the pond, that wipes out our morning. A little warning of such a major outage to non-american customers would have been helpful. (We're snowed in here in Manchester UK, so working from home is much harder without email!!!)

    ReplyDelete