VSP2122 VMware vMotion in VMware vSphere 5.0: Architecture, Performance and Best Practices

Summary at the top = it’s easy to forget how incredible vMotion is…good review and then deep-dive on vMotion and recent improvements plus best practices.

Over 5x performance improvement in some areas in vSphere 5.

  • So what is vMotion?
    • Something that we love
    • completely transparent to the guest
    • invaluable tool to admins (avoid server downtime, allow troubleshooting
    • provide flexibility
    • Key enabler of DRS, DPM, FT
  • What needs to be Migrated?
    • Moving entire VM state
    • Uses “checkpoint” infrastructure
    • Look at all VM’s virtual devices and serialize their state into a blog, transfer it, deserialize it at the destination.
      • Serialization is around 8 MB
    • But..not quite that simple due to associations with physical resources.
    • Devices = Processor, Device State (CPU, network, SVGA, etc.)
    • Disk — shared storage required of course.
    • Network — reverse ARP of course.
    • Memory — pre-copy VM while VM is running….memory is the coolest thing here.
  • Naive Memory Copy — just suspend the VM, move the memory, unsuspend
    • Not good….64 GB VM requires 51 seconds on 10 GigE….much more
      • Not mentioned but shades of Hyper-V original implementation.
    • Instead VM runs during the vast majority of vMotion
    • Ititerative memory “pre-copy” in theory goes until no outstanding memory
  • Memory Iteratative Pre-copy
    • First Phase, ‘Trace Phase/HEAT Phase’
      • Send the VM’s ‘cold’ pages from source to destination.
      • Trace all the VM’s memory….so know when future pages change.
      • Performance impact: noticeable brief drop in throughput due to trace installation, related to memory size.
    • Subsequent Phases
      • Keep passing over memory and tracing each page as transmitted.
      • Performance impact: minimal on guest performance
    • Switch-over phase
      • Once pre-copy has converged, very few dirty pages remain.
      • VM is momentarily quiesced for switch-over.
      • Performance impact: increase of latency as the guest is stopped, duration less than a second.
  • vMotion in 4.1….it’s a bit more than that.
    • What if VM is dirtying faster than can transfer memory?
      • 4.1 did RDPI, i.e. quick resume. You’d fail over even through pre-copy didn’t finish.
      • Memory not transferred yet would be pulled remotely from the source host even though VM was running on the destination host.
    • Not the best approach for performance so rewrote for ESX 5.
  • vSphere 5 Performance Enhancements – read this section
    • Memory pre-copy
      • lower impact when installing memory traces
      • optimized to handle 10 GigE, can now fully saturate 10 GigE
      • Multi NIC enhancements to further reduce pre-copy time.
      • New feature SDPS (stun during page-send) kicks in during pathological cases
        • Better than RDPI and handles pre-copy convergence failures better than 4.1
        • Basically am forcing pre-copies to converge.
        • Will introduce microsecond delays into vCPU just enough so that network transmit rate will climb above VM’s rate of dirtying memory.
        • Much, much better than RDPI — lower performance yes but better than RDPI and can guarantee higher levels of performance.
    • Copy remainder of memory from source to destination
      • Improvements to reduce duration and impact on guest during switch-over phase.
      • RDPI is disabled entirely in favor of SDPS.
  • Test configuration with vSphere 5.
    • 2 Nehalem hosts, 2 sockets, quad-core Xeon, 96 GB memory
    • Three 10 GigE NICs, one for client, 2 for client traffic.
  • How to measure vMotion performance
    • Resource usage, Total Duration, Switch-over Time
    • Performance impact on applications running inside the guest.
    • App latency and throughput during vMotion
    • Time to resume during normal level of performance.
  • Testing Workloads — everything pretty much.
    • Web (SPECweb2005), Email (Exchange 2010), DB/OLTP (SQL Server 2010), VDI/Cloud-Oriented
  • There’s a vMotion Migration ID per VM — can use that to go look in the vmkernel logs and see a ton of info (starting, amount through the pre-copy, cutting over, etc.).
  • Test Results
    • 37% drop in vMotion duration in vSphere for web work load (30 seconds down to 18 seconds).
    • All of this with 12,000 web sessions generating 6 Gbps web traffic.
    • No network connections dropped during vMotion.
    • Minimal performance impact during memory trace install of vMotion.
  • vMotion performance on GigE vs. 10 GigE
    • Almost a 10x improvement using 10 GigE vs. GigE
    • Seriously considering switching to 10 GigE for vMotion network.
    • GigE vMotion on vSphere 4.1 could lead to network connection drops due to memory copy convergence issues.
    • On vSphere 5 even a pathological workload does not cause network connection drops during vMotion.
  • Database workloads
    • 35% reduction in vMotion time
    • 2.3x improvement on vSphere 5 when using multiple NICs.
    • Similar performance improvement during memory trace installation.
  • VDI Workload
    • Time to evacuate 64 VMs dropped from 11 minutes to 2 minutes.
    • More super graphs.
  • Best Practices
    • Switch to a 10 GigE vMotion Network
    • Consider using multiple 10 GigE NICs for vMotion
      • Configure them all under same vSwitch.
      • Configure each vmknic to use a separate vmnic as its active vmnic (rest marked as standby).
      • vMotion will transparently fail over.
    • If concerned about vMotion performance…..
      • Consider placing VM swap files on shared storage (SAN or NAS).
      • Using host-local swap or leveraging SSD for swap cache can impact vMotion performance (as means there’s more to transfer).
    • Use ESX clusters composed of matching NUMA architectures when using vNUMA features.
      • vNUMA topology of the VM is set during the power-on base on the NUMA topology of physical host.
      • vMotion to a host with a different NUMA topology may result in reduced performance.
    • When using CPU reservations, leave some slack….
      • 30% of a CPU unreserved at host level.
      • 10% of CPU capacity unreserved at cluster level.
  • Conclusions

20110830-014943.jpg

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s