How high availability works on various storage arrays is something of an ignored topic in my experience (it definitely was during a customer-facing conversation earli this week….the inspiration for this post). During the sales process, there’s usually a statement about “clustered heads” and “automatic failover” and the conversation then moves on. Let’s dive into the weeds a bit though, shall we?
Note: just for simplicity, I’m going to use the acronym SP (Service Processor) to refer to a Service Processor, filer head, etc. (although I realize SP is the EMC acronym it’s also the shortest to type 😉 ).
Failover Models
- Two SP’s, Synchronized Cache, Immediate Failover — cache is kept constantly synchronized between SP’s via a high speed interconnect (Infiniband, PCIe with custom cable, 10 GigE, etc.). During a failover the second SP takes over all LUNs immediately where “immediately” is defined as 2-3 seconds maximum with sub-second takeover as normal. Takeover time depends on SP load and # of LUNs so isn’t precisely deterministic. While both SP’s are active, any single LUN is only truly active on a single SP at a time although LUNs can be manually transferred between SP’s relatively easily for performance optimization (an SP failover is really nothing more than a mass transfer of LUNs between SP’s).
- This is the model for EMC’s Clariion – FLARE OS.
- Two SP’s, Synchronized Cache, Delayed Failover — same cache consistency however during a failover the remaining SP/Head boots its partner’s identity from scratch. Guest/host/application timeouts become critical here as the failover can take anywhere from 15 seconds up to 2-3 minutes depending on array load and # of LUNs (even less deterministic performance-wise). While both SP’s are active, a single LUN can only be served from one SP at a time and transferring LUNs between SP’s is usually a non-trivial operation (i.e. not something done as part of day to day operations/performance optimizations).
- This is the model for all NetApp arrays (given their using a single OS) — Data ONTap OS.
- Update: EMC’s Celerra also uses this model more or less in DART (NAS-focused OS/platform so not precisely applicable here but including for completeness).
- Multi-way SP’s, Multi-way shared cache, Instant Failover — let’s go in reverse here. Since LUNs are actually active on multiple SP’s, failover for an SP failure is instant — it’s purely an MPIO event with the LUN already live on another SP. Cache is multi-way synchronized via high speed interconnects with the SP’s being aware of the local cache of other SP’s before hitting disk.
- This is the basic model for enterprise class arrays, i.e. Symmetrix V-Max (and vPlex as well), Hitachi UPS/VPS.
Note: I fully realize this list is not conclusive and is SAN focused (will consider NAS at a later point)…right now at least it’s based on the arrays I’ve worked with/researched extensively (with that expanding over the last several months as I learn EMC gear). I would love to hear from others out there on the failover models used by other arrays (Compellent, EQ, Lefthand, etc.).
So what’s the impact? There is one thing I’d like to call out here…
Host/OS/Application Configuration
You may have already picked up on this….but if your storage array can take up to 2-3 minutes for an array failover, what does that do to your ESX host, Windows OSes, application, etc? Well…bad things….crashing and/or data corruption for instance.
All array vendors with this model have recommendations/scripts/tools around setting higher ESX/Windows/etc. timeouts but there are two wildcards here.
- Applications — in a lot of cases, you can’t tune applications to wait 2-3 minutes during an I/O pause. They may simply crash if during a high activity period (Murphy’s law….arrays fail during business hours).
- Consistency — anytime you have to make settings changes in multiple spots (even with helpful tools to do it), you’ve added complexity. And while I hate to say it, it’s inevitable in a larger environment (especially with multiple admins) that some of the settings won’t get in place correctly. You’ll find out that they’re missing someday though….and given that most arrays use array failover to handle Non-Disruptive Upgrades, it may be sooner than you’d like.
If your array fails over in 2-3 seconds or immediately, how much host/OS/Guest configuration do you need to do?
None.
How much time and uncertainty will that save you?
~~~~~~~
Comments much less debate are more than welcome!
Great post. That really helped me to understand the failover model difference between the major vendors.
LikeLike
Just for Clarity, a SP on a symmetrix is a Service Processor used for config and monitoring, the Storage controllers are called Directors and they are in Engines, and there are 2 Directors per Engine. Great article!
LikeLike
Thanks for the clarification…much appreciated. The genesis of this post was actually a discussion around timeouts for array failover (something Clariion does relatively well compared to competition in the mid-range). I knew once I broadened the post I’d be missing details (much less entire array vendors) so appreciate the detail (actually had a half paragraph just on how memory cache sync/lookups/awareness is handled with V-Max/vPlex but then realized it was more than I should put in this post).
When I worked for a NetApp VAR, we’d run into customers quite frequently with issues around timeouts (not set correctly, not aware of various layers, etc.) so as I’ve dug into failover models on other arrays faster array failover is something that has really resonated for me.
LikeLike
I was a Professional Services Consultant for NetApp for 5 years, and I performed failover testing on a variety of configurations with and without load probably hundreds of times. There are a couple of points I would like to make.
First, the numbers you gave for NetApp failover times are far from typical. Back when their FCP protocol first was implemented, the failover times were pretty long. But, every mid-range vendor’s failover times were longer back then. I performed a series of tests for a large enterprise customer about 5 years ago and, on a well loaded system with LUNs and I/O failover times ranged from 30-60 seconds. This was upper-limit testing, very atypical to a customer environment. Are there configurations where this may be longer, I’m sure you could find them. But, I bet you could find odd cases with all vendors. Over the years as new releases have emerged that number has gone down dramatically. Nearly a year ago–the last bit of testing I witnessed– failover times were in the 5-10 second range. I suspect they are even better with the current version.
The other point, is that there are more than just array timeouts to be concerned with. HBA and MPIO timeout and retry settings can inject far more application wait time than an array failover as they retry on a path that isn’t coming back. It is important to ALWAYS use the vendors best practice for timeouts. Otherwise, on an active/active array, the app could be waiting on HBA retries long after the paths should have switched. NetApp has a nice free host utility that quickly verifies all HBA and MPIO settings (most of the current HBA settings are left default anymore, though), and the troubleshooting tools that come with it are worth the install.
LikeLike
@Mike Richardson Thanks for the comment! I definitely won’t dispute there can be variance in the timing. Having said that….
I can’t claim hundreds of times….but probably somewhere past 100 in 5 years as a customer and two years as a VAR engineer. For planned cluster failover/giveback on smaller systems (I primarily worked with 2000 and 30×0/31×0 series) anything faster than 10-15 seconds was very uncommon (given a planned failover array load was always low). For something unplanned, it would commonly be much longer than that for 2 reasons — array load was higher and there was an initial delay while the partner head figured out its partner was really dead (i.e. some sanity checking before taking over the identity and starting the boot process).
I actually just checked on a planned failover done recently — 30 seconds on a midrange array (a bit longer to actually have full response via iSCSI, etc.). When settings timeouts (if memory serves), NetApp does recommend timeouts be bumped to 3 minutes or the max (whichever is higher). I do very clearly recall Botkin talking about getting as close to 3 minutes as possible in his SAN class (which is a fantastic class by the way).
I’m not sure if you were generally working with mid-range to upper-end gear as a PSC with NetApp (most PSC’s I talked to barely if ever touched the 2000 range and often not the 3140 either….bottom of the mid-range)….the lower-end/mid-range stuff is often more CPU and/or disk bound during array failover frankly. At one point, a potential customer needed a guarantee that failover would be sub-60 seconds from an array perspective (guarantee never happened…neither did the deal unfortunately).
And of course no disagreement on following best practices for MPIO settings. The free utility in question (NetApp Host Utilities for Windows/Linux/AIX/etc. — part of VSC now for ESX) is very nice….but is another thing to keep track of (and on Windows annoyingly requires a reboot…when setting up a bunch of hosts, just adds to the time required). I hardly ever used the extra tools in it actually (I did on VMware (mbrscan + mbralign are nice) but hardly ever on Windows).
Having said all that, we often found people who didn’t have timeouts set correctly (was just another thing they had to do/keep track of) and getting applications to honor timeouts was sometimes (not always) impossible….if you catch certain apps during busy times, they just die rather than wait.
Now, none of this is inherent criticism of any one approach — they all work with different caveats (Celerra follows a similar model to NetApp….hmm, need to revise my post to include that…is NAS though) while Clariion/V-Max/etc. don’t and picks up some benefits IMHO (not worrying about timeouts at all is a benefit to me 🙂 ).
LikeLike
Great summary. Very refreshing to have you provide a comparison based on your recent experience delivering NetApp services and your journey down the EMC path. You keep things balanced and try to provide details based on your REAL experiences. Keep up the good work.
Cheers!
Skogs
LikeLike
— disclosure NetApp employee —
Interesting,
I’ve seen LUN trespasses and planned takeovers happen pretty (and dare I say, impressively) quickly on a Clariion in the past, but I’ve also seen some relatively lengthy takeovers on CX and other similar platforms under a variety of situations, especially under load with ungraceful shutdowns of one SP (power failures mostly) or path failures. Those relatively lengthy takeovers certainly took (much) longer than 2-3 seconds.
It seems to me that you’re suggesting that you don’t need to worry about timeout values because you’re using EMC equipment, which from my point of view would be unwise. It’s arguable that by installing PowerPath you effectively get the correct timeout values set for you, however that is a potentially expensive and disruptive process, and tends to wed your storage environment to EMC from that time on.
If you are using the O/S providers multi-path software (the preferred option Microsoft, IBM, HP, Oracle, etc), then there are multiple places where I’ve seen recommendations to set the timeout values for Clariion to three minutes (180 sec) to avoid problems. Having said that, it’s been a while since I’ve done this, so this might have changed with more recent versions of O/S multi-path software and FLARE versions, but I still remember customers talking about the importance of setting correct timeout values for Clariion with ESX 3.5 not _that_ long ago.
Regardless of who’s storage you use still think it’s a good idea to care about your timeout settings and follow the best practice recommendations from whichever vendors (storage, OS and application) you’re working with.
Regards
John
LikeLike
@John Martin
Thanks for the comment….surprised to still see this post has anyone reading it. I did also want to say that I REALLY appreciated your series on VDI and storage….really, really good stuff (highly recommend to anyone else out there who’s still reading the comments on this post).
You do raise a very good point…hard to plan for the odd type failovers where things take longer. And I’m willing to admit I got a bit carried away…but will admit down here rather than editing the post. Permit me to add some background…
I came from a NetApp reseller where I’d very, very often run across customers who didn’t have timeouts set correctly. Given a failover event (especially equipment), these people were very likely to have VM’s crashing, etc. (just from my experience). We’d always of course set timeouts during installs (Host Utilities, VSC, etc.) and explain why they were needed. Regardless though, new hosts would come online, applications might not be configured, etc….and there would be periodic problems.
I won’t claim years of SP failover experience….but what I found pretty cool was how the architecture when it does work can hit 1 seconds or less (really just MPIO stuff)….whereas the “Delayed Failover” when it does work is multiples higher than that. Just very different designs…. (the Clariion design for instance is faciliated by it only handling SAN protocols)
So yes, I’d be quite willing to agree (and revise myself) that it’s a good best practice to set timeouts. BUT..in the inevitable circumstance when they’re not set correctly (dynamic environments, less informed admins, etc.), I still think it holds true the “immediate failover” model makes the price of not setting timeouts much less painful.
Thanks again for the comment!
LikeLike