Marvin’s notes from the field: VxRail ACE Lost Connection

Note: The following is purely informative, always contact your support representative when facing production issue.

Earlier this month, Dell EMC announced the integration of VxRail appliances into CloudIQ, the free-of-charge cloud based monitoring solution. While the onboarding is pretty straight forward, I’ve recently faced a “lost-connection” problem with a cluster, let’s see what to do in such a case.

First of all, the cluster was not visible into the CloudIQ interface but was reporting a “Lost Connection” into “MyVxRail” (new VxRail ACE name). After a short analysis, we figured out that this connection loss curiously matched with our latest VxRail code upgrade.

The following entries in /var/log/mystic/telemetry/spv_adc.log confirmed my suspicions:

2019-08-31 16:15:27,716-ADC-ERROR-Load json: No JSON object could be decoded-[myutils.py:20] Traceback (most recent call last):   File "/mystic/telemetry/DCManager/ADC/../ADC/plugins/collect_cms.py", line 337, in <module>
AttributeError: 'NoneType' object has no attribute 'get' 2019-08-31 16:15:27,754-ADC-ERROR-[Process Pool] process collection error. Error message is Command '/mystic/telemetry/DCManager/ADC/../venv/bin/python /mystic/telemetry/DCManager/ADC/../ADC/plugins/collect_cms.py --output Performance/cms/708c5012-cc34-11e9-af50-000c2910e5ca' returned non-zero exit status 1-[ExecutionPool.py:32] 2019-08-31 16:15:27,767-ADC-ERROR-[Process Pool] There is an error in collection plugin! cmd is /mystic/telemetry/DCManager/ADC/../venv/bin/python /mystic/telemetry/DCManager/ADC/../ADC/plugins/collect_cms.py --output Performance/cms/708c5012-cc34-11e9-af50-000c2910e5ca-[DCPackManager.py:39]

Indeed, the ADC – Adaptive Data Collector, (not to be confused with any famous band 🤘) is a service running on VxRail, managing environmental usage, peformance, capacity and configuration meta data.

This service relies on a configuration file, capacity.json, to work which is located on the VxRail Manager at:

/mystic/telemetry/DCManager/ADC/plugins/capacity.json

If for any reason, this file was reporting size of 0 or was unreadable, this could cause the service to stop running properly and this was exactly what we faced:

capacity.json file corrupteed

If your file looks as mine was, reporting a 0 byte in size and you can see the same logs entries, it’s very likely you’re facing the same issue. Here are the steps needed in order to regenerate the file:

  1. Download the following file and upload it to your VxRail Manager VM, under /tmp
  1. SSH to the VxRail Manager appliance using “mystic” account, then switch to “root”
  2. Move the file from /tmp to /mystic/telemetry/DCManager/update folder:
vxm:/home/mystic # mv /tmp/DCManager_replace_capacity_json.tgz /mystic/telemetry/DCManager/update/
  1. Run the following to apply the patch:
vxm:/home/mystic # cd /mystic/telemetry/DCManager/update/ 
vxm/mystic/telemetry/DCManager/update/# ls -l 
total 128 
-rw-r--r-- 1 tcserver pivotal  5817 Apr 22 13:03 DCManager_replace_capacity_json_csp.tgz
-rw-r--r-- 1 tcserver pivotal  4536 Feb 26 07:08 README.md
-rwxr-xr-x 1 tcserver pivotal     0 Feb 26 07:08 __init__.py
-rw-r--r-- 1 tcserver pivotal   108 Feb 26 07:09 __init__.pyc
vxm:/mystic/telemetry/DCManager/update # chown tcserver:pivotal DCManager_replace_capacity_json.tgz 
vxm:/mystic/telemetry/DCManager/update # su tcserver
tcserver@vxm:/mystic/telemetry/DCManager/update> tar zxvf DCManager_replace_capacity_json.tgz
tcserver@vxm:/mystic/telemetry/DCManager/update> ../venv/bin/python DCManager_replace_capacity_json/update.py

Once the script has finished the execution, check the capacity.json again, it might be different from 0 and should increase with time, after a couple of minutes:

After a couple of minutes:

After a couple of minutes:{"capacity": {"5983d58e44-a98-caf6-8d58e-3d58e469054b":{"2021-02-0508:25:00": 387079229, "2021-02-0508:45:00": 769932956, "2021-02-0508:40:00": 769932946, "2021-02-0510:35:00": 769525186, "2021-02-0510:30:00": 769525186, "2021-02-0509:15:00": 769832747, "2021-02-0510:10:00": 769620394, "2021-02-0509:10:00": 769832747, "2021-02-0510:40:00": 769525186, "2021-02-0510:50:00": 769525197, "2021-02-0508:15:00": 387079229, "2021-02-0510:15:00": 769620394, "2021-02-0509:45:00": 769783634, "2021-02-0508:10:00": 387079228, "2021-02-0509:25:00": 769832754, "2021-02-0508:30:00": 769932946, "2021-02-0508:35:00": 769932946, "2021-02-0509:40:00": 769783622, "2021-02-0510:25:00": 769620400, "2021-02-0509:35:00": 769783622, "2021-02-0508:55:00": 769932956, "2021-02-0510:45:00": 769525197, "2021-02-0508:20:00": 387079229, "2021-02-0508:50:00": 769932956, "2021-02-0509:55:00": 769783639, "2021-02-0509:05:00": 769832747, "2021-02-0510:20:00": 769620400, "2021-02-0509:50:00": 769783639, "2021-02-0510:00:00": 769620394, "2021-02-0509:00:00": 769832747, "2021-02-0510:05:00": 769620394, "2021-02-0509:20:00": 769832754, "2021-02-0509:30:00": 769783622}}}

Now, fifteen minutes later, both MyVxRail and CloudIQ display the cluster as connected:

VxRail – Connected

Additional resources:

CloudIQ Onboarding Procedure: https://www.dell.com/support/kbdoc/en-uk/000184396/cloudiq-general-procedures-to-onboard-vxrail-into-cloudiq

Check-out our latest posts:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code