vSAN 7 Troubleshooting Tools

The primary configuration & management point for vSAN is the vCenter client. But sometimes, troubleshooting has to dive a little bit deeper so additional tools are needed, how lucky we are, vSAN comes with its own. For this last post of 2020, let’s have a look at vSAN 7 Troubleshooting Tools and how to use it.

Skyline Health

VMware Skyline Health is the result of two in-product support offering, vSAN Health, and Skyline Health. It provides proactive findings based on centralized knowledge base articles and recommendations to avoid problems before they occur. Skyline Health can work online, requiring an active support contract and CEIP enrollment or offline, without any requirement and hourly checks.

To top it, the UI is fully vCenter integrated which provides a really great management experience. VMware Skyline Health is available starting with vSphere 6.7 P01 (or vSAN 6.7U3a) and later.

Skyline health

The online checks feature sends the collected data to the VMware analytics back-end system for advanced analysis. The cluster must be configured to participate in the Customer Experience Improvement Program (CEIP) in order to use it.

Skyline Health for vSAN can also be used to remediate some faulty situations. For example, the manual re-protection of unhealthy objects after a node failure:

vSAN Object Health

There’s also a CLI view for Skyline Health in addition to the HTML5 UI integration, making it possible to automate anything you want in reaction to an abnormal status:

[root@nuc-02:~] esxcli vsan health cluster list -w
Health Test Name                                                       Status
---------------------------------------------------------------------  ------
Overall health                                                         green (OK)
Cluster                                                                green
  Advanced vSAN configuration in sync (advcfgsync)                     green
  vSAN daemon liveness (clomdliveness)                                 green
  vSAN Disk Balance (diskbalance)                                      green
  Resync operations throttling (resynclimit)                           green
  Software version compatibility (upgradesoftware)                     green
  Disk format version (upgradelowerhosts)                              green
[...]

You can use the esxcli vSAN health cluster get -t command to query a specific vSAN Health test. This will work with any of the test list above entered as a command option:

[root@nuc-02:~] esxcli vsan health cluster get -t "Disk format version"

For more information about Skyline Health, don’t hesitate to read this post and checkout VMware blog.

Use the Shell

We’ve seen how Skyline Health can be a true ally for day-to-day management and troubleshooting. Now, for deeper problem solving, you may need to use the CLI, let’s see which options vSphere provides to us.

vSANTOP

After having access to the ESXi shell of a particular host from your vSAN enabled cluster, local access (DCUI), or remote through SSH, you can use the vsantop utility. This utility is similar to esxtop but focuses on monitoring vSAN performance metrics with complete awareness of vSAN architecture to retrieve focused metrics at a detailed interval.

Issuing a simple vsantop command will provide you with the following information:

vSAN top

As you can see, the command outputs a default entity type, host-domclient, and the associated metrics at an interval. There are a lot more entities to look at, to navigate to another when vsantop is running just press (capital) E key will provide you the following menu:

Entities list

The current entity is displayed on the top of the list and can also be identified through the * below its menu entry. Once you’ve done your selection, enter the corresponding number, and press the return key:

Entities selection

If for any reason, the default view of each entity doesn’t match your expectation, you can add/remove fields. Pressing the f key will get you to the following menu:

Filter entities

As indicated on top, there are 10 fields available. In this example, if I wanted to add one, I would need to remove another one first. To add or remove a field, just enter its id and press the return key.

There’s another way to invoke vsantop which is a batch mode. Indeed, you can capture vsantop result in a file, using the following syntax:

vsantop -b -d [delay] -n [iterations] > [file location & name]

usage: vsantop [-h] [-v] [-b] [-d delay] [-n iterations]
    -h prints this help menu.
    -v prints version.
    -b enables batch mode.
    -d sets the delay between updates in seconds.
    -n runs vsantop for only n iterations. Use "-n infinity" to run forever.

Example using a 5 seconds interval, 20 iterations and save the result in a csv file into my vSAN datastore:

vsantop -b -d 5 -n 20 > /vmfs/volumes/Twin-vsanDatastore/vsantop/result.csv

The result will contain all the information based on the given interval and for the number of iterations.

ESXCLI Commands

As you may know, this command is definitely a powerful tool when talking about troubleshooting and not only for vSAN. ESXCLI is composed of multiple namespaces, providing information & control of different aspects of the host. For complete esxcli documentation, see this page.

Here are some commands or namespaces you may find very handy in your vSAN troubleshooting journey:

Get the complete command & subcommand list (use | grep to filter it):

[root@nuc-02:~] esxcli esxcli command list

To display all the available shell commands, double press the Tab key.

In the following example, I’m looking to get information about the 512GB disk, I used a simple grep to find it. Another interesting way to find what you are looking for is to grep on the vmbha identifier.

[root@nuc-02:~] esxcli storage core device list | grep 512GB

List the physical uplinks:

[root@nuc-02:~] esxcli network nic list

or

[root@nuc-02:~] esxcfg-nics --l

Quickly retrieve which VMkernel port is in use by vSAN:

[root@nuc-02:~] esxcli vsan network list

To query the disks and check if they eligible for vSAN, you can use the vdq command. The following command will display the disks currently mapped to vSAN in a “human-readable” format:

[root@nuc-02:~] vdq -i -H

ESXCLI Debug Namespace

This namespace provides debugging information on multiple vSAN components, disks, objects, resync objects, storage controllers, etc. Let’s have a look at this interesting tool:

[root@nuc-02:~] esxcli vsan debug
Usage: esxcli vsan debug {cmd} [cmd options]

Available Namespaces:
  disk                  Debug commands for vSAN physical disks
  object                Debug commands for vSAN objects
  resync                Debug commands for vSAN resyncing objects
  advcfg                Debug commands for vSAN advanced configuration options.
  controller            Debug commands for vSAN disk controllers
  evacuation            Debug commands for simulating host, disk or disk group evacuation in various modes and their impact on objects
                        in vSAN cluster
  limit                 Debug commands for vSAN limits
  memory                Debug commands for vSAN memory consumption.
  mob                   Debug commands for vSAN Managed Object Browser Service.
  vmdk                  Debug commands for vSAN VMDKs

The vSAN debug command, allows you for example to have a precise view of the impact of an evacuation from a particular host. In the following example, I will simulate the evacuation of the localhost, while ensuring accessibility of the vSAN objects:

[root@nuc-02:~] esxcli vsan debug evacuation precheck -e localhost -a ensureAccess
Action: Ensure Accessibility
Evacuation Outcome: Success
Entity: Host localhost
Data to Move: 1.58 GB
Number Of Objects That Would Become Inaccessible: 0
Objects That Would Become Inaccessible: None
Number Of Objects That Would Have Redundancy Reduced: 38
Objects That Would Have Redundancy Reduced: (only shown with --verbose option)
Additional Space Needed for Evacuation: N/A

Embedded Python Scripts

ESXi includes a bunch of useful scripts, which may change over the releases, these are located into /usr/lib/vmware/vsan/bin. Below some examples of the provided features:

  • VSANDeviceMonitor.py – Advanced monitoring of latencies and congestion with proactive unmount of slow or faulty devices from the cluster.
  • reboot_helper.py – Graceful reboot of a vSAN Cluster (see: https://kb.vmware.com/s/article/60424).
  • vsanDiskFaultInjection.pyc – Introduce a software hot-plug failure state into a storage device for test/validation purpose.
  • killInaccessibleVms.py – Kill inaccessible “ghost” VMs.

and many more… don’t hesitate to browse, open the scripts and go through it for more information.

RVC

You may be already familiar with RVC which stands for Ruby vSphere Console and is the Linux Console UI for vSphere running on vCenter Server, RVC aim to help with vSAN management and troubleshooting.

To use RVC, first open an SSH session as root on your vCenter appliance:

Command> rvc administrator@sso.domain@localhost
[DEPRECATION] This gem has been renamed to optimist and will no longer be supported. Please switch to optimist as soon as possible.
Install the "ffi" gem for better tab completion.
WARNING: Nokogiri was built against LibXML version 2.9.8, but has dynamically loaded 2.9.10
password:
0 /
1 localhost/
>

The administrator account specified here is the one who has administrator privileges on vCenter, vSAN datacenter and vSAN clusters. Once authenticated, navigate to the vSAN cluster, RVC includes commands such as “ls” and “cd” to browse the infrastructure hierarchy. The “tree id” on the left of the command prompt can be used to ease the navigation. My target here is the vSAN cluster named “Twin”:

rvc navigation

Once you’ve reached the vSAN cluster, you can execute vSAN related commands to get the information you’re looking for.

To get the list of the available commands, type:

/localhost/VMHLAB 4.0/computers/Physical-Infrastructure> help vsan

From here, I’m able to trigger commands related to the cluster:

/localhost/VMHLAB 4.0/computers/Physical-Infrastructure> vsan.cluster_info 0

Another example, if you want to get information related to the disk of a particular host:

rvc vSAN Disk info

In summary, the tools we’ve seen are the key tools allowing you to get the most possible information on your vSAN environment. Now we’ve seen which tool we can use, the next step will be to look at how we can use them in order to accurately troubleshoot our infrastructure in an upcoming article, stay tuned!

One Reply to “vSAN 7 Troubleshooting Tools”

  1. Great Article!
    Adding here the vSAN performance diagnostics:
    It provides feedback on how to extract the best performance in a given vSAN cluster. It consumes the data available via the vSAN performance service in vSAN health and provides details on performance issues seen in the VSAN cluster.
    To use vSAN performance diagnostics, you must join the “Customer Experience Improvement Program (CEIP)”, and enable vSAN Performance Service.
    –> See KB 2148770

Leave a Reply

Your email address will not be published. Required fields are marked *

WC Captcha + 45 = 49