VSP3868 – VMware vStorage Best Practices

I was in an “Upgrading to ESXi 5.0” session but it was mostly textbook so left to join Joe Kelly here.

Customer Support Session – the guys who are at the other end of not-so-fun support calls around storage issues.

Backups, backups backups.
Know your storage upfront…or know your storage admin (get to be friends before things go to…).
Keep up with array updates/versions.
Aboug 30% of environments don’t have current HBA firmware — common pain point.
Capacity != Performance….aka TB != IOPs.
VAAI = good and we love it. Letting storage do what storage does best is a big win. Turn it on…and it’s on by default in ESXi 4.1.
- Benefits = less CPU/memory on ESX hosts, faster Storage vMotion, faster deployment/cloning of VMs, reduces SCSI reservation conflicts, faster zeroing.
- SCSI primitives = Full Copy, Block Zeroing, Atomic Test and Set (ATS) Locking
- All SCSI2 reservations (take me back to the 80’s or so?)….ATS scraps all that and we’re up into 2010 or so.
- NAS support comes in vSphere 5.
Pluggable Storage Architecture and the Path Selection Plugins
- Most Recently Used (MRU) — take the same road to work each day, when road is closed take another road and keep taking that road until closed.
- Fixed — take another road if main one is closed but always go back to the first road.
- Round Robin
Multipathing
- This takes me back to a pretty crazy iSCSI multipathing + NFS config I did a while back.
- vmkernel port group per physical NIC.
- Lots of varying vendor best practices (go see the iSCSI super posts).
- Use ALUA if at all possible.
- In ESX(i) 4.1 you have to do commandline NIC binding to setup iSCSI MPIO correctly.
- In ESXi 5 all the iSCSI MPIO steps are in the GUI.
Storage I/O Control – SIOC
- Monitors I/O latency on datastores to establish normalized baseline over time.
- Default is 30 ms for datastore to be considered congested.
Performance Monitoring
- 10 ms or less is adequate, 20 ms or higher sustained should be investigated.
- Commands not acknowledged by SAN within 5000 ms will be aborted.
- Having an abort every now and then is not a cause of concern. Multiple aborts or pages of aborts…well….need to investigate (often tied to backups).
Common Storage Issues
- VM Snapshots
  - Again, snapshots != backups.
  - Configure alarm to monitor vCenter snapshots – KB 1018029.
  - Avoid multi-layered snapshots.
  - General rule = don’t use snapshots for more than 24-72 hours (yes yes yes).
  - Check the datastore for snapshot files if not sure — do a nifty “find” command on the commandline (I used to have this memorized)
  - If boot up a machine with snapshots and it “accidentally” went back to a couple days/weeks/months ago, power down and call VMWare immediately (if you power down right away, they can help you recover the data in the later snapshots).
  - New in ESX(i) 4.0U2 — snapshot deletion takes up less space on disk.
  - ESXi 5.0 — new functionality to monitor snapshots and provide warning if snapshots need to be consolidated.
  - KB 1025279 – snapshot best practices.
- Misconfiguration
  - Firmware and driver issues — check the HCL and update if applicable.
  - Most common one is HBA firmware, FC switches, NICs (especially 10 GigE) after that.
  - Make sure to have consistent path selection for each LUN across all hosts.
  - LUN detected as a snapshot — force-mount from commandline if needed to get data, must ultimately resignature to resolve.
- Improper Device Removal
  - Physical device w/VMFS or RDM goes away improperly.
  - APD = All Paths Dead
  - Upgrade to 4.0 U2 or 4.1 U1 to get better behavior.
  - Follow KB 1015084 to help with this.
  - Always rescan after making changes in the storage environment.

Think Meta

Exploring IT Layers

VSP3868 – VMware vStorage Best Practices

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply