Summary = high-energy walkthrough by Scott Lowe around the promise of stretched clusters, the high level details (part 1 below), and the low level details (part 2). Fantastic run-through of a fascinating subject (fascinating to me at least as I’m actively having SRM + VPLEX discussions with customers).
Scott Lowe – Stretched Cluster Discussion
- Part 1: Stretched Cluster or SRM?
- vMSC = vSphere Metro Stretched Cluster
- Introduces some new terms:
- Uniform access = “stretched SAN” – one storage array basically with access back across to another side.
- Non-uniform access = “distributed virtual storage” – VPLEX basically….no one else doing this.
- EMC worked with VMware to create this category…it’s the reason VPLEX is the first and still only on this list.
- Provides boundaries
- RTO vs. RPO — critical to determining solution.
- RPO of near zero = need some kind of synchronous solution
- RPO of minutes to hours = async stuff.
- DR verus DA
- DA = Disaster Avoidance
- Seeks to protect apps b/4 a disaster occurs.
- How often do you know before a disaster is going to occur?
- Similar to vMotion – have to have both ESX hosts (aka both sites) up for a DA solution.
- DR = Disaster Recovery
- Seeks to recover applications and data after a disaster occurs
- Think of DA as vMotion and DR as vSphere HA
- DA = Disaster Avoidance
- SRM Details
- Some form of storage replication
- Layer 3 Connectivity
- No minimum bandwidth requirements – purely driven by SLA/RPO/RTO
- No max latency between sites – purely driven by SLA/RPO/RTO
- At least (2) vCenter Server instances
- Requirements for vMSC
- Some form of supported sync active/active storage architecture.
- Must be read/write on both ends — traditional replication is read/write + read/only at destination.
- Stretched Layer 2 Connectivity – as vMotion has some IP # at destination as source.
- 622 Mbps bandwidth (minimum) between sites
- Less than 5 ms latency between sites (10 ms with vSphere 5 Enterprise Plus/Metro vMotion)
- This is roundtrip time without factoring in replication traffic.
- A single vCenter Server instance (prob want to protect vCenter though with vCenter Heartbeat)
- This is b/c we can’t vMotion between vCenter instances.
- Some form of supported sync active/active storage architecture.
- Advantages for SRM
- Defined startup orders (with prerequisites) – db server first, then web server, then app server, etc. etc.
- No need for stretched Layer 2 connectivity (but supported)
- The ability to simulate workload mobility without affecting production
- Supports multiple vCenter Server instances (including in Linked Mode)
- Advantages of vMSC
- Possibility of non-disruptive workload migration (disaster avoidance)
- Lots of gating factors though.
- No need to deal with issues around IP address changes.
- Potenial for running active/active data centers and more easily balancing workloads between them
- Typically a near-zero RPO with RTO of minutes
- Lots and lots of caveats
- Requires only a single vCenter server instance.
- Possibility of non-disruptive workload migration (disaster avoidance)
- Disadvantages of SRM
- Typically higher RPO/RTO and vMSC
- Workload mobility is always disruptive (requires reboot)
- Requires at least (2) vCenter Server instances
- Operational overhead from managing protection groups and protection plans.
- Have to place VM’s on data stores, etc. based on dependencies, etc.
- Disadvantages of vMSC
- Greater physical networking complexity due to stretched Layer 2 connectivity requirement
- Greater cost resulting from higher-end networking equipment, more bandwidth, active/active storage solution.
- No ability to test workload mobility – this matters – can’t test a vMotion…just try it and see what happens
- Operational overhead from management of DRS host affinity groups.
- Supports only a single vCenter server instance.
- What about a mixed architecture?
- It can be done, but it has its own design considerations.
- For any given workload, it’s an “either/or” situation.
- Varrow has done this — not sure how many other partners actually have.
- Part 2: Building Stretched Clusters – First time ever presented.
- vSphere Recommendations
- Use vSphere 5 – eliminates some HA limitations (eliminates primaries and secondaries), introduces the vMSC HCL category.
- Use vSphere DRS host affinity groups – can mimic site awareness, use PowerCLI to address manageability concerns, use “should” rules rather than “must” rules.
- Using PowerCLI with host affinity groups – add a unique property to “group” VMs, use this “grouping to automate VM placement into groups, run the PowerCLI script regularly to ensure correct group assignment.
- Have to turn on admission control to always keep half the hosts reserved.
- Storage Recommendations
- Use storage from vMSC category – only VPLEX right now….what a bummer.
- Be aware of storage performance considerations – know how Reads and Writes will be impacted.
- Account for storage availability.
- Plan Storage DRS carefully.
- Use profile-drive storage – VASA.
- Examples – MetroCluster – vMotion to 2nd site and reads/writes go back across WAN until Storage vMotion
- Example – VPLEX in non-uniform mode – reads are always serviced locally although writes may still go across WAN.
- Example – VPLEX in Metro
- Storage Availability
- Know how things are impacted during any failure.
- Consider cross-connect topology.
- Ensure multiple storage controllers at each site for availability.
- Provide redundant and independent inter-site storage connections.
- With VPLEX, use the third-site cluster witness (needs to be in separate failure domain).
- Storage DRS Cautions
- Align datastore boundaries to site/array boundaries.
- Don’t combine stretched/non-stretched datastores
- Understand impact of SDRS on overall storage solution.
- Use Profile-driven storage — this is VASA
- Keep things profile-driven, can help avoid operational concerns with VM placement.
- Networking Recommendations
- Plan for different traffic patterns – we’re talking trombones here.
- Look at OTV, LISSP if you haven’t already.
- Where possible, separate management traffic onto a vSwitch.
- Incorporate redundant and independent inter-site network connections.
- Minimize latency as MUCH as possible.
- Plan for different traffic patterns – we’re talking trombones here.
- Operational Recommendations
- Account for backup/restore in your design — many people overlook.
- Where do tapes sit? Running Avamar?
- If don’t duplicate backup topologies, really want to look at client-side dedup to reduce WAN traffic.
- Mechanism to reduce restore traffic would be nice as well.
- Might be able to use storage solution itself for restores – restore to local side, allow storage to replicate to remote side.
- Handle inter-site vMotion carefully – it’s new coolness but introduces operational concerns
- Will impact DRS host affinity rules.
- Could require storage config updates
- Reconcile DRS host affinity rules and VM locations
- Reconcile storage availability and VM locations
- Impact on other operational areas.
- Do we need to notify other people in the org as VM’s move between data centers?
- Look at monitoring, backups, IT staff, etc.
- Don’t split multi-tier apps.
- Account for backup/restore in your design — many people overlook.
- From audience – UCS Express is great use case for VPLEX witness.
- vSphere Recommendations
Pingback: Think Meta » Closing Thoughts – VMware Partner Exchange