Cluster Validation Suite (CVS) installation#
2025-11-12
5 min read time
System requirements#
CVS supports these GPUs:
AMD Instinct™ MI325X
AMD Instinct™ MI300X
CVS supports these Linux distributions:
Operating system |
Kernel |
ROCm version (tested on) |
Python version (tested on) |
|---|---|---|---|
Ubuntu 24.04.3 |
6.8 [GA], 6.14 [HWE] |
7.0.2 |
3.10 |
Ubuntu 22.04.5 |
5.15 [GA], 6.8 [HWE] |
7.0.2 |
3.10 |
Install CVS#
Run CVS from a node (head node), such as an Ubuntu virtual machine/bare metal with or without GPU in that node. It’s recommended to run CVS from head node that is not a part of the test cluster. This is to avoid loss of data if the node requires a reboot (such as during a system failure).
Git clone the package:
git clone https://github.com/ROCm/cvs
The CVS GitHub repository is organized in these directories:
tests: This folder contains the PyTest scripts which internally call the library functions under the./libdirectory. They’re in native Python and can be invoked from any Python scripts for reusability. Thetestsdirectory contains a subfolder based on the nature of the tests, such as health, RCCL, training, and more.lib: This is a collection of Python modules with utility functions that can be reused in other Python scripts.input: This is a collection of the input JSON files that are provided to the PyTest scripts using the two arguments--cluster_fileand the--config_file. The--cluster_fileis a JSON file which captures all the aspects of the cluster test bed, such as the IP address/hostnames, username, keyfile, and more.utils: This is a collection of standalone scripts that can be run natively without PyTest. They offer different utility functions.
Navigate to the extracted directory and run the installation script:
cd cvs
Set the environment:
python3 -m venv myenv source myenv/bin/activate pip3 install -r requirements.txt
Configure the CVS cluster file#
The cluster file is a JSON file containing the cluster’s IP addresses. You must configure the cluster file before you run any CVS tests.
Go to
cvs/input/cluster_file/cluster.jsonin your cloned repo.Edit the management IP (
"mgmt_ip") and node dictionary ("node_dict") with the list of IPs of the available cluster.Ensure the user-id (
"{user-id}") andpriv_key_filematch.
Here’s a code snippet of the cluster.json file for reference:
cluster.json
{
"_user_comment": " user-id will be resolved to current username in runtime. You can also change to your user-id here.",
"username": "{user-id}",
"_key_comment": " Change <priv_key_file> to your private key if it is different.",
"priv_key_file": "/home/{user-id}/.ssh/id_rsa",
"_node_comment": " Change to your node IPs. The Public IPs of the nodes will be the keys of the node_dict",
"_vpc_comment": "If your cluster has a dedicated VPC IP that is reachable from other nodes in the cluster, set it to that (or) else set the same as the main host IP/Name",
"head_node_dict":
{
"mgmt_ip": "{xx.xx.xx.xx|hostname}"
},
"node_dict":
{
"{xx.xx.xx.xx|hostname}":
{
"bmc_ip": "NA",
"vpc_ip": "{xx.xx.xx.xx|hostname}"
},
"{xx.xx.xx.xx|hostname}":
{
"bmc_ip": "NA",
"vpc_ip": "{xx.xx.xx.xx|hostname}"
},
"{xx.xx.xx.xx|hostname}":
{
"bmc_ip": "NA",
"vpc_ip": "{xx.xx.xx.xx|hostname}"
}
}
}
Set up your tests#
There are JSON configuration files for each CVS test. You must configure the JSON file for each test you want to run in CVS.
The test configuration files are in the cvs/input/config_file directory of the cloned repo.
Tip
See Test configuration files for code snippets and parameters of each configuration file.
Follow these instructions for each test you’d like to conduct.
Platform#
In the cvs/input/config_file/platform/host_config.json file, modify these parameters to suit your use case:
os_versionkernel_versionrocm_versionbios_version
Health#
In the cvs/input/config_file/health/mi300_health_config.json file, edit the paths to your desired location in these parameters:
Under
agfhc:pathpackage_tar_ballinstall_dir
Under
transferbench:example_tests_pathgit_install_path
Under
rvs:git_install_path
InfiniBand (IB Perf)#
In the cvs/input/config_file/ibperf/ibperf_config.json file, update the install_dir parameter to your desired location.
Change any other parameters in the configuration file relevant to your testing requirements.
ROCm Communication Collectives Library (RCCL)#
In the cvs/input/config_file/rccl/rccl_config.json and cvs/input/config_file/rccl/single_node_mi355_rccl.json files, change the directory path to your desired location in these variables:
rccl_dirrccl_tests_dirmpi_dirmpi_path_varrccl_path_varrocm_path_var
JAX / Megatron training configuration files#
Parameters with the <changeme> value must have that value modified to your specifications.
Change any other parameters in the configuration file relevant to your testing requirements.