VSP3868 – VMware vStorage Best Practices

I was in an “Upgrading to ESXi 5.0” session but it was mostly textbook so left to join Joe Kelly here.

Customer Support Session – the guys who are at the other end of not-so-fun support calls around storage issues.

  • Backups, backups backups.
  • Know your storage upfront…or know your storage admin (get to be friends before things go to…).
  • Keep up with array updates/versions.
  • Aboug 30% of environments don’t have current HBA firmware — common pain point.
  • Capacity != Performance….aka TB != IOPs.
  • VAAI = good and we love it. Letting storage do what storage does best is a big win. Turn it on…and it’s on by default in ESXi 4.1.
    • Benefits = less CPU/memory on ESX hosts, faster Storage vMotion, faster deployment/cloning of VMs, reduces SCSI reservation conflicts, faster zeroing.
    • SCSI primitives = Full Copy, Block Zeroing, Atomic Test and Set (ATS) Locking
    • All SCSI2 reservations (take me back to the 80’s or so?)….ATS scraps all that and we’re up into 2010 or so.
    • NAS support comes in vSphere 5.
  • Pluggable Storage Architecture and the Path Selection Plugins
    • Most Recently Used (MRU) — take the same road to work each day, when road is closed take another road and keep taking that road until closed.
    • Fixed — take another road if main one is closed but always go back to the first road.
    • Round Robin
  • Multipathing
    • This takes me back to a pretty crazy iSCSI multipathing + NFS config I did a while back.
    • vmkernel port group per physical NIC.
    • Lots of varying vendor best practices (go see the iSCSI super posts).
    • Use ALUA if at all possible.
    • In ESX(i) 4.1 you have to do commandline NIC binding to setup iSCSI MPIO correctly.
    • In ESXi 5 all the iSCSI MPIO steps are in the GUI.
  • Storage I/O Control – SIOC
    • Monitors I/O latency on datastores to establish normalized baseline over time.
    • Default is 30 ms for datastore to be considered congested.
  • Performance Monitoring
    • 10 ms or less is adequate, 20 ms or higher sustained should be investigated.
    • Commands not acknowledged by SAN within 5000 ms will be aborted.
    • Having an abort every now and then is not a cause of concern. Multiple aborts or pages of aborts…well….need to investigate (often tied to backups).
  • Common Storage Issues
    • VM Snapshots
      • Again, snapshots != backups.
      • Configure alarm to monitor vCenter snapshots – KB 1018029.
      • Avoid multi-layered snapshots.
      • General rule = don’t use snapshots for more than 24-72 hours (yes yes yes).
      • Check the datastore for snapshot files if not sure — do a nifty “find” command on the commandline (I used to have this memorized)
      • If boot up a machine with snapshots and it “accidentally” went back to a couple days/weeks/months ago, power down and call VMWare immediately (if you power down right away, they can help you recover the data in the later snapshots).
      • New in ESX(i) 4.0U2 — snapshot deletion takes up less space on disk.
      • ESXi 5.0 — new functionality to monitor snapshots and provide warning if snapshots need to be consolidated.
      • KB 1025279 – snapshot best practices.
    • Misconfiguration
      • Firmware and driver issues — check the HCL and update if applicable.
      • Most common one is HBA firmware, FC switches, NICs (especially 10 GigE) after that.
      • Make sure to have consistent path selection for each LUN across all hosts.
      • LUN detected as a snapshot — force-mount from commandline if needed to get data, must ultimately resignature to resolve.
    • Improper Device Removal
      • Physical device w/VMFS or RDM goes away improperly.
      • APD = All Paths Dead
      • Upgrade to 4.0 U2 or 4.1 U1 to get better behavior.
      • Follow KB 1015084 to help with this.
      • Always rescan after making changes in the storage environment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s