BCO2395 Stretch Clustering – Now and Beyond

Got to this one a little late as squeezed in part of a Scott Lowe session earlier…

Summary at the top = nice walkthrough of a mostly functioning environment. Get the slide deck to see the futures stuff (most interesting part of the session).

TIAA-CREF talking about their stretch cluster design.

Network Requirements
- Common Layer 2 Connectivity & IP Portability across sites.
- Component-level & site-level failure isolation
- Secure, consolidated multi-tier connectivity
- High density, high speed connectivity within site
- Granular QoS
Network Solution
- Cisco Nexus 7k/5k/2k/1kv
  - OTV for layer 2 between sites
  - vPC, ISSU — uplinks from two separate switches and looks like one switch.
  - VDC —
  - NX-OS QoS (CoS tagging with FWQ) — can tag at 1kV and gets passed all the way down.
- DWDM
- Their round trip is 0.25 ms (first stated as 25 ms…just a small difference).
Storage Requirements
- Component-level & site-level failure isolation — any part of the stack needs to be able to survive on its own.
- Secure, consolidated multi-tier resource sharing
- High performance I/O
- Efficient capacity utilization – storage isn’t cheap.
- Network-based storage protocol
Storage Solution
- NetApp FAS6080
  - Metrocluster — Fabric-attached storage + SyncMirror) — used a lot in Europe but not a lot in the US
    - Controller at each site.
    - You can’t really vMotion between sites though….sets of disks are active/passive at each site (no active/active disk architecture).
  - MultiStore — lots of NFS/Windows file servers.
  - Deduplication — working well so doing more of it.
    - Using FlashCache to help with I/O as well.
  - NFS
  - TieBreaker (third party witness).
    - Split brain is a challenge as truly clustering data centers. Lots of scenarios around this that are complicated (I’m personally familiar with some of these from a previous MetroCluster implementation at a customer).
    - This is a NetApp product but could use more refining….basically just a really nice Perl script.
- 30 TB Usable Storage per site — 60 TB Raw at each site.
Compute Requirements
- UCS, etc. etc. nothing fancy here.
- vCenter Heartbeat
Challenges
- Cultural Challenges
  - “Non-traditional” application protection model (comfort zone)
  - Complexity/Availability Tradeoff (Traditional and Stretch Differences)
    - Single site model inherently less available
    - Non-stretch, two site model incorrectly perceived as less complex.
- Resource Challenges
  - Increased infrastructure cost in comparison to single-site
  - Increased implementation time
- Operational Challenges
  - Application HA Site awareness
    - Where are application components in relation to each other?
  - Emerging Technologies
- Technical Challenges
  - Network — L2 extension and FHRP isolation/async Routing Control
  - Storage – multisite controller config, etc.
  - Compute
  - Mobility of management infrastructure — vCenter and Nexus 1kV solution.
  - Split-brain mitigation
Technical Challenges and Solutions
- Network
  - OTV Faliover Time (MAC learning) — NX-OS 5.1(3)
  - Deterministic Routing — LISP & HSRP Localization
  - Nexus 1000v single-site HA — unresolved
- Storage
  - Split-brain with automated failover — Tiebreaker
  - Dedup during site failover — unresolved
  - Single-Site HA — unresolved
- Compute
  - VMware HA Primaries — Minimum one HA Primary per site
  - Split- Brain — Lack of storage availability prevents VM power-on
  - HA Site Awareness — unresolved
  - Priority values for DRS rule sets — unresolved
vSphere 5.0 improvements related to stretch clusters
- New design of vSphere HA
  - No more primary/secondary construct
  - No more dependencies on external entities (such as DNS)
  - Use of heartbeat data stores
    - Partitions of the Management Network are now supported!
    - Increase ability to detect problems.
    - Provides another level of redundancy.
  - IPv6 Support
  - Improved Logging — one log file
Futures — lots more here than I typed below.
- Add ability to mark a particular host as in a degraded mode — keep starting up VMs via HA on a host that is likely to fail soon.
- App availability — application level heartbeat with vSphere HA in vSphere 5. App health visibility in vCenter 5.
- Automated stretch cluster configuration.
- Metro vMotion — longer distances by increasing supported ms latency.
- DRaaS — DR to the cloud. Why have a second datacenter? Better way to do DR?

Think Meta

Exploring IT Layers

BCO2395 Stretch Clustering – Now and Beyond

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply