I was in an “Upgrading to ESXi 5.0” session but it was mostly textbook so left to join Joe Kelly here.
Customer Support Session – the guys who are at the other end of not-so-fun support calls around storage issues.
- Backups, backups backups.
- Know your storage upfront…or know your storage admin (get to be friends before things go to…).
- Keep up with array updates/versions.
- Aboug 30% of environments don’t have current HBA firmware — common pain point.
- Capacity != Performance….aka TB != IOPs.
- VAAI = good and we love it. Letting storage do what storage does best is a big win. Turn it on…and it’s on by default in ESXi 4.1.
- Benefits = less CPU/memory on ESX hosts, faster Storage vMotion, faster deployment/cloning of VMs, reduces SCSI reservation conflicts, faster zeroing.
- SCSI primitives = Full Copy, Block Zeroing, Atomic Test and Set (ATS) Locking
- All SCSI2 reservations (take me back to the 80’s or so?)….ATS scraps all that and we’re up into 2010 or so.
- NAS support comes in vSphere 5.
- Pluggable Storage Architecture and the Path Selection Plugins
- Most Recently Used (MRU) — take the same road to work each day, when road is closed take another road and keep taking that road until closed.
- Fixed — take another road if main one is closed but always go back to the first road.
- Round Robin
- Multipathing
- This takes me back to a pretty crazy iSCSI multipathing + NFS config I did a while back.
- vmkernel port group per physical NIC.
- Lots of varying vendor best practices (go see the iSCSI super posts).
- Use ALUA if at all possible.
- In ESX(i) 4.1 you have to do commandline NIC binding to setup iSCSI MPIO correctly.
- In ESXi 5 all the iSCSI MPIO steps are in the GUI.
- Storage I/O Control – SIOC
- Monitors I/O latency on datastores to establish normalized baseline over time.
- Default is 30 ms for datastore to be considered congested.
- Performance Monitoring
- 10 ms or less is adequate, 20 ms or higher sustained should be investigated.
- Commands not acknowledged by SAN within 5000 ms will be aborted.
- Having an abort every now and then is not a cause of concern. Multiple aborts or pages of aborts…well….need to investigate (often tied to backups).
- Common Storage Issues
- VM Snapshots
- Again, snapshots != backups.
- Configure alarm to monitor vCenter snapshots – KB 1018029.
- Avoid multi-layered snapshots.
- General rule = don’t use snapshots for more than 24-72 hours (yes yes yes).
- Check the datastore for snapshot files if not sure — do a nifty “find” command on the commandline (I used to have this memorized)
- If boot up a machine with snapshots and it “accidentally” went back to a couple days/weeks/months ago, power down and call VMWare immediately (if you power down right away, they can help you recover the data in the later snapshots).
- New in ESX(i) 4.0U2 — snapshot deletion takes up less space on disk.
- ESXi 5.0 — new functionality to monitor snapshots and provide warning if snapshots need to be consolidated.
- KB 1025279 – snapshot best practices.
- Misconfiguration
- Firmware and driver issues — check the HCL and update if applicable.
- Most common one is HBA firmware, FC switches, NICs (especially 10 GigE) after that.
- Make sure to have consistent path selection for each LUN across all hosts.
- LUN detected as a snapshot — force-mount from commandline if needed to get data, must ultimately resignature to resolve.
- Improper Device Removal
- Physical device w/VMFS or RDM goes away improperly.
- APD = All Paths Dead
- Upgrade to 4.0 U2 or 4.1 U1 to get better behavior.
- Follow KB 1015084 to help with this.
- Always rescan after making changes in the storage environment.
- VM Snapshots