How high availability works on various storage arrays is something of an ignored topic in my experience (it definitely was during a customer-facing conversation earli this week….the inspiration for this post). During the sales process, there’s usually a statement about “clustered heads” and “automatic failover” and the conversation then moves on. Let’s dive into the weeds a bit though, shall we?
Note: just for simplicity, I’m going to use the acronym SP (Service Processor) to refer to a Service Processor, filer head, etc. (although I realize SP is the EMC acronym it’s also the shortest to type 😉 ).
- Two SP’s, Synchronized Cache, Immediate Failover — cache is kept constantly synchronized between SP’s via a high speed interconnect (Infiniband, PCIe with custom cable, 10 GigE, etc.). During a failover the second SP takes over all LUNs immediately where “immediately” is defined as 2-3 seconds maximum with sub-second takeover as normal. Takeover time depends on SP load and # of LUNs so isn’t precisely deterministic. While both SP’s are active, any single LUN is only truly active on a single SP at a time although LUNs can be manually transferred between SP’s relatively easily for performance optimization (an SP failover is really nothing more than a mass transfer of LUNs between SP’s).
- This is the model for EMC’s Clariion – FLARE OS.
- Two SP’s, Synchronized Cache, Delayed Failover — same cache consistency however during a failover the remaining SP/Head boots its partner’s identity from scratch. Guest/host/application timeouts become critical here as the failover can take anywhere from 15 seconds up to 2-3 minutes depending on array load and # of LUNs (even less deterministic performance-wise). While both SP’s are active, a single LUN can only be served from one SP at a time and transferring LUNs between SP’s is usually a non-trivial operation (i.e. not something done as part of day to day operations/performance optimizations).
- This is the model for all NetApp arrays (given their using a single OS) — Data ONTap OS.
- Update: EMC’s Celerra also uses this model more or less in DART (NAS-focused OS/platform so not precisely applicable here but including for completeness).
- Multi-way SP’s, Multi-way shared cache, Instant Failover — let’s go in reverse here. Since LUNs are actually active on multiple SP’s, failover for an SP failure is instant — it’s purely an MPIO event with the LUN already live on another SP. Cache is multi-way synchronized via high speed interconnects with the SP’s being aware of the local cache of other SP’s before hitting disk.
- This is the basic model for enterprise class arrays, i.e. Symmetrix V-Max (and vPlex as well), Hitachi UPS/VPS.
Note: I fully realize this list is not conclusive and is SAN focused (will consider NAS at a later point)…right now at least it’s based on the arrays I’ve worked with/researched extensively (with that expanding over the last several months as I learn EMC gear). I would love to hear from others out there on the failover models used by other arrays (Compellent, EQ, Lefthand, etc.).
So what’s the impact? There is one thing I’d like to call out here…
You may have already picked up on this….but if your storage array can take up to 2-3 minutes for an array failover, what does that do to your ESX host, Windows OSes, application, etc? Well…bad things….crashing and/or data corruption for instance.
All array vendors with this model have recommendations/scripts/tools around setting higher ESX/Windows/etc. timeouts but there are two wildcards here.
- Applications — in a lot of cases, you can’t tune applications to wait 2-3 minutes during an I/O pause. They may simply crash if during a high activity period (Murphy’s law….arrays fail during business hours).
- Consistency — anytime you have to make settings changes in multiple spots (even with helpful tools to do it), you’ve added complexity. And while I hate to say it, it’s inevitable in a larger environment (especially with multiple admins) that some of the settings won’t get in place correctly. You’ll find out that they’re missing someday though….and given that most arrays use array failover to handle Non-Disruptive Upgrades, it may be sooner than you’d like.
If your array fails over in 2-3 seconds or immediately, how much host/OS/Guest configuration do you need to do?
How much time and uncertainty will that save you?
Comments much less debate are more than welcome!