The primary configuration & management point for vSAN is the vCenter client. But sometimes, troubleshooting has to dive a little bit deeper so additional tools are needed, how lucky we are, vSAN comes with its own. For this last post of 2020, let’s have a look at vSAN 7 Troubleshooting Tools and how to use it.
Skyline Health
VMware Skyline Health is the result of two in-product support offering, vSAN Health, and Skyline Health. It provides proactive findings based on centralized knowledge base articles and recommendations to avoid problems before they occur. Skyline Health can work online, requiring an active support contract and CEIP enrollment or offline, without any requirement and hourly checks.
To top it, the UI is fully vCenter integrated which provides a really great management experience. VMware Skyline Health is available starting with vSphere 6.7 P01 (or vSAN 6.7U3a) and later.

The online checks feature sends the collected data to the VMware analytics back-end system for advanced analysis. The cluster must be configured to participate in the Customer Experience Improvement Program (CEIP) in order to use it.
Skyline Health for vSAN can also be used to remediate some faulty situations. For example, the manual re-protection of unhealthy objects after a node failure:

There’s also a CLI view for Skyline Health in addition to the HTML5 UI integration, making it possible to automate anything you want in reaction to an abnormal status:
[root@nuc-02:~] esxcli vsan health cluster list -w Health Test Name Status --------------------------------------------------------------------- ------ Overall health green (OK) Cluster green Advanced vSAN configuration in sync (advcfgsync) green vSAN daemon liveness (clomdliveness) green vSAN Disk Balance (diskbalance) green Resync operations throttling (resynclimit) green Software version compatibility (upgradesoftware) green Disk format version (upgradelowerhosts) green [...]
You can use the esxcli vSAN health cluster get -t command to query a specific vSAN Health test. This will work with any of the test list above entered as a command option:
[root@nuc-02:~] esxcli vsan health cluster get -t "Disk format version"
For more information about Skyline Health, don’t hesitate to read this post and checkout VMware blog.

We’ve seen how Skyline Health can be a true ally for day-to-day management and troubleshooting. Now, for deeper problem solving, you may need to use the CLI, let’s see which options vSphere provides to us.
vSANTOP
After having access to the ESXi shell of a particular host from your vSAN enabled cluster, local access (DCUI), or remote through SSH, you can use the vsantop utility. This utility is similar to esxtop but focuses on monitoring vSAN performance metrics with complete awareness of vSAN architecture to retrieve focused metrics at a detailed interval.
Issuing a simple vsantop command will provide you with the following information:

As you can see, the command outputs a default entity type, host-domclient, and the associated metrics at an interval. There are a lot more entities to look at, to navigate to another when vsantop is running just press (capital) E key will provide you the following menu:

The current entity is displayed on the top of the list and can also be identified through the * below its menu entry. Once you’ve done your selection, enter the corresponding number, and press the return key:

If for any reason, the default view of each entity doesn’t match your expectation, you can add/remove fields. Pressing the f key will get you to the following menu:

As indicated on top, there are 10 fields available. In this example, if I wanted to add one, I would need to remove another one first. To add or remove a field, just enter its id and press the return key.
There’s another way to invoke vsantop which is a batch mode. Indeed, you can capture vsantop result in a file, using the following syntax:
vsantop -b -d [delay] -n [iterations] > [file location & name] usage: vsantop [-h] [-v] [-b] [-d delay] [-n iterations] -h prints this help menu. -v prints version. -b enables batch mode. -d sets the delay between updates in seconds. -n runs vsantop for only n iterations. Use "-n infinity" to run forever.
Example using a 5 seconds interval, 20 iterations and save the result in a csv file into my vSAN datastore:
vsantop -b -d 5 -n 20 > /vmfs/volumes/Twin-vsanDatastore/vsantop/result.csv
The result will contain all the information based on the given interval and for the number of iterations.
ESXCLI Commands
As you may know, this command is definitely a powerful tool when talking about troubleshooting and not only for vSAN. ESXCLI is composed of multiple namespaces, providing information & control of different aspects of the host. For complete esxcli documentation, see this page.
Here are some commands or namespaces you may find very handy in your vSAN troubleshooting journey:
Get the complete command & subcommand list (use | grep to filter it):
[root@nuc-02:~] esxcli esxcli command list
To display all the available shell commands, double press the Tab key.
In the following example, I’m looking to get information about the 512GB disk, I used a simple grep to find it. Another interesting way to find what you are looking for is to grep on the vmbha identifier.
[root@nuc-02:~] esxcli storage core device list | grep 512GB
List the physical uplinks:
[root@nuc-02:~] esxcli network nic list or [root@nuc-02:~] esxcfg-nics --l
Quickly retrieve which VMkernel port is in use by vSAN:
[root@nuc-02:~] esxcli vsan network list
To query the disks and check if they eligible for vSAN, you can use the vdq command. The following command will display the disks currently mapped to vSAN in a “human-readable” format:
[root@nuc-02:~] vdq -i -H
ESXCLI Debug Namespace
This namespace provides debugging information on multiple vSAN components, disks, objects, resync objects, storage controllers, etc. Let’s have a look at this interesting tool:
[root@nuc-02:~] esxcli vsan debug Usage: esxcli vsan debug {cmd} [cmd options] Available Namespaces: disk Debug commands for vSAN physical disks object Debug commands for vSAN objects resync Debug commands for vSAN resyncing objects advcfg Debug commands for vSAN advanced configuration options. controller Debug commands for vSAN disk controllers evacuation Debug commands for simulating host, disk or disk group evacuation in various modes and their impact on objects in vSAN cluster limit Debug commands for vSAN limits memory Debug commands for vSAN memory consumption. mob Debug commands for vSAN Managed Object Browser Service. vmdk Debug commands for vSAN VMDKs
The vSAN debug command, allows you for example to have a precise view of the impact of an evacuation from a particular host. In the following example, I will simulate the evacuation of the localhost, while ensuring accessibility of the vSAN objects:
[root@nuc-02:~] esxcli vsan debug evacuation precheck -e localhost -a ensureAccess Action: Ensure Accessibility Evacuation Outcome: Success Entity: Host localhost Data to Move: 1.58 GB Number Of Objects That Would Become Inaccessible: 0 Objects That Would Become Inaccessible: None Number Of Objects That Would Have Redundancy Reduced: 38 Objects That Would Have Redundancy Reduced: (only shown with --verbose option) Additional Space Needed for Evacuation: N/A
Embedded Python Scripts
ESXi includes a bunch of useful scripts, which may change over the releases, these are located into /usr/lib/vmware/vsan/bin. Below some examples of the provided features:
- VSANDeviceMonitor.py – Advanced monitoring of latencies and congestion with proactive unmount of slow or faulty devices from the cluster.
- reboot_helper.py – Graceful reboot of a vSAN Cluster (see: https://kb.vmware.com/s/article/60424).
- vsanDiskFaultInjection.pyc – Introduce a software hot-plug failure state into a storage device for test/validation purpose.
- killInaccessibleVms.py – Kill inaccessible “ghost” VMs.
and many more… don’t hesitate to browse, open the scripts and go through it for more information.
RVC
You may be already familiar with RVC which stands for Ruby vSphere Console and is the Linux Console UI for vSphere running on vCenter Server, RVC aim to help with vSAN management and troubleshooting.
To use RVC, first open an SSH session as root on your vCenter appliance:
Command> rvc administrator@sso.domain@localhost [DEPRECATION] This gem has been renamed to optimist and will no longer be supported. Please switch to optimist as soon as possible. Install the "ffi" gem for better tab completion. WARNING: Nokogiri was built against LibXML version 2.9.8, but has dynamically loaded 2.9.10 password: 0 / 1 localhost/ >
The administrator account specified here is the one who has administrator privileges on vCenter, vSAN datacenter and vSAN clusters. Once authenticated, navigate to the vSAN cluster, RVC includes commands such as “ls” and “cd” to browse the infrastructure hierarchy. The “tree id” on the left of the command prompt can be used to ease the navigation. My target here is the vSAN cluster named “Twin”:

Once you’ve reached the vSAN cluster, you can execute vSAN related commands to get the information you’re looking for.
To get the list of the available commands, type:
/localhost/VMHLAB 4.0/computers/Physical-Infrastructure> help vsan
From here, I’m able to trigger commands related to the cluster:
/localhost/VMHLAB 4.0/computers/Physical-Infrastructure> vsan.cluster_info 0
Another example, if you want to get information related to the disk of a particular host:

In summary, the tools we’ve seen are the key tools allowing you to get the most possible information on your vSAN environment. Now we’ve seen which tool we can use, the next step will be to look at how we can use them in order to accurately troubleshoot our infrastructure in an upcoming article, stay tuned!
Great Article!
Adding here the vSAN performance diagnostics:
It provides feedback on how to extract the best performance in a given vSAN cluster. It consumes the data available via the vSAN performance service in vSAN health and provides details on performance issues seen in the VSAN cluster.
To use vSAN performance diagnostics, you must join the “Customer Experience Improvement Program (CEIP)”, and enable vSAN Performance Service.
–> See KB 2148770