Run Cluster Validation Suites tests

Run Cluster Validation Suites tests#

2025-11-12

11 min read time

Applies to Linux

The CVS PyTest scripts are in the cvs/tests folder of the cloned GitHub repository. Every test must be run from the cvs root folder, as the system lib paths have been set accordingly.

These are the arguments typically used in these test scripts:

-log-file: The text log file where the Python logger outputs are captured.
-s: The script that PyTest executes.
--cluster_file: The location of the cluster file, which has the details of the cluster, IPs, and access details.
--config_file: This is the configuration file used for the test. Depending on the test suite being run, the configuration might vary. The sample input files are organized as sub-directories under the cvs/input folder, similar to the PyTests.
--html: This is the output HTML report file generated by PyTest at the end of the script. It has a summary of the number of test cases that have passed/failed. You can navigate to the logs directly in the browser from this report.
--capture=tee-sys: Captures all std.out and std.err writes from your tests.
--self-contained-html: Generate as a single HTML report, including the styling and embedded images for all test cases.

You can also create a wrapper shell script to run multiple test suites sequentially by putting the different PyTest run commands in a bash script as described in the README file under the cvs/tests/health folder.

Run CVS test scripts#

Platform test script#

The host check scripts can validate various host-side configurations, such as model load balancing enablement, PCIe checks, kernel version, and ROCm version.

Here are the host check test cases:

check_os_release
check_kernel_version
check_bios_version
check_gpu_fw_version
check_online_memory
check_dmesg_driver_errors
check_rocm_version
check_pci_realloc
check_iommu_pt
check_numa_balancing
check_pci_accelarators
check_pci_speed_width
check_pci_acs

Here’s the test script:

pytest -vvv --log-file=/tmp/test.log -s ./tests/platform/host_configs_cvs.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/platform/host_config.json --html=/var/www/html/cvs/host.html --capture=tee-sys --self-contained-html

Burn-in health test scripts#

The burn-in health tests are single node diagnostic tests that validate the hardware and firmware versions’ functionality and performance. For the performance validation, they use the reference bandwidth or latency numbers provided as part of the input config_file for the relevant test.

Use these scripts to run each health test. These CVS test scripts have two parts: installing the functionality and running the tests.

AGFHC#

See the AGFHC (AMD GPU Field Health Check) docs for more information.

These are the test cases for AGFHC:

test_agfhc_hbm
test_agfhc_hbm1_lvl5
test_agfhc_hbm2_lvl5
test_agfhc_hbm3_lvl3
test_agfhc_dma_all_lvl1
test_agfhc_dma_lvl1
test_agfhc_gfx_lvl1
test_agfhc_pcie_lvl1
test_agfhc_pcie_lvl3
test_agfhc_xgmi_lvl1
test_agfhc_all_perf
test_agfhc_all_lvl5

Use these scripts to start the test:

Run the installation:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/install/install_agfhc.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/agfhc.html --capture=tee-sys --self-contained-html

Run the AGFHC test:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/agfhc_cvs.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/agfhc.html --capture=tee-sys --self-contained-html

Run the CSP qualification test:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/csp_qual_agfhc.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/agfhc.html --capture=tee-sys --self-contained-html

TransferBench#

See the TransferBench docs for more information.

These are the test cases for TransferBench:

test_transfer_bench_example_tests_1_6_t
test_transfer_bench_a2a
test_transfer_bench_p2p
test_transfer_bench_healthcheck
test_transfer_bench_a2asweep
test_transfer_bench_scaling
test_transfer_bench_schmoo

Use these scripts to start the test:

Run the installation:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/install/install_transferbench.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/transferbench.html --capture=tee-sys --self-contained-html

Start the TransferBench test:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/transferbench_cvs.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/transferbench.html --capture=tee-sys --self-contained-html

RVS#

See the ROCm Validation Suite (RVS) docs for more information.

These are the test cases for RVS:

gpup_single
mem_test
gst_single
iet_single
pebb_single
pbqt_single
peqt_single
rcqt_single
tst_single
babel_stream

Use these scripts to start the test:

Run the installation:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/install/install_rvs.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/rvs.html --capture=tee-sys --self-contained-html

Start the RVS test:

pytest -vvv --log-file=/tmp/test.log -s ./tests/health/rvs_cvs.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/health/mi300_health_config.json --html=/var/www/html/cvs/rvs.html --capture=tee-sys --self-contained-html

InfiniBand (IB Perf) test script#

IB Perf and latency tests measure network performance. Perf tests measure throughput (bandwidth), and latency tests measure delay. Perf tests, such as ib_write_bw, evaluate the maximum data transfer rate under different message sizes. Latency tests, such as ib_send_lat, measure the time it takes for a message to travel between two nodes and often report results like minimum, median, and maximum latency.

These are the IB Perf test cases:

test_install_ib_perf
test_ib_bw_perf
test_ib_lat_perf
test_build_ib_bw_perf_chart
test_build_ib_lat_perf_chart

Use these scripts to start the test:

Run the installation:

pytest -vvv --log-file=/tmp/test.log -s ./tests/ibperf/install_ibperf_tools.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/ibperf/ibperf_config.json --html=/var/www/html/cvs/ib.html --capture=tee-sys --self-contained-html

Start the IB Perf test:

pytest -vvv --log-file=/tmp/test.log -s ./tests/ibperf/ib_perf_bw_test.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/ibperf/ibperf_config.json --html=/var/www/html/cvs/ib.html --capture=tee-sys --self-contained-html

ROCm Communication Collectives Library (RCCL) test script#

These RCCL tests are comprehensive benchmarks that validate distributed GPU communication performance across AMD GPU clusters. These tests ensure optimal performance for AI training, HPC workloads, and distributed computing.

Here are the RCCL test cases:

hostinfo
networkinfo
all_reduce_perf
all_gather_perf
scatter_perf
gather_perf
reduce_scatter_perf0
reduce_scatter_perf1
sendrecv_perf
alltoall_perf
alltoallv_perf
broadcast_perf

Note

You should already have your RCCL environment setup in all of the nodes. If not, see Build collective tests for more information.

Use these test scripts to run the JAX tests.

Single node test MI35XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/rccl/rccl_singlenode_cvs.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/rccl/single_node_mi355_rccl.json --html=/var/www/html/cvs/rccl.html --capture=tee-sys --self-contained-html

Multi-node test MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/rccl/rccl_multinode_cvs.py --cluster_file input/cluster_file/cluster.json --config_file input/config_file/rccl/rccl_config.json --html=/var/www/html/cvs/rccl.html --capture=tee-sys --self-contained-html

JAX training test scripts#

The JAX training tests perform end-to-end orchestration and validation of a distributed JAX training run (Llama 3.1) across a multi-node cluster using Docker containers, remote execution, and the project’s JAX library.

The file uses PyTest and parallel SSH to prepare the environment, launch containers, and run/verify a short distributed training job.

These are the JAX training test cases:

test_llama_3_1_70b_training
test_llama_3_1_405b_distributed
test_llama_3_1_70b_training
test_llama_3_1_405b_distributed

Use these test scripts to run the JAX tests.

Note

Ensure all nodes are capable of running dockers.

Single Node 70b MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/jax/singlenode_llama_3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/jax/mi300x_singlenode_llama3_1_70b.json --html=/var/www/html/cvs/jax.html --capture=tee-sys --self-contained-html

Single Node 70b MI35X#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/jax/singlenode_llama_3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/jax/mi35x_singlenode_llama_3_1_70b.json --html=/var/www/html/cvs/jax.html --capture=tee-sys --self-contained-html

Distributed 70b MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/jax/distributed_llama_3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/jax/mi300x_distributed_llama3_1_70b.json --html=/var/www/html/cvs/jax.html --capture=tee-sys --self-contained-html

Distributed 405b MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/jax/distributed_llama3_1_405b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/jax/mi300x_distributed_llama_3_1_405b.json --html=/var/www/html/cvs/jax.html --capture=tee-sys --self-contained-html

Megatron training test scripts#

Megatron training enables scaling transformer models from millions to trillions of parameters by efficiently utilizing hundreds or thousands of GPUs across multiple nodes.

The Megatron tests check:

Container orchestration: Docker setup with ROCm/RDMA
Multi-node communication: NCCL/RCCL initialization
Model convergence: Loss decreases and no NaN/Inf values
Performance targets: Throughput and memory usage within expected ranges
Result verification: Expected tokens/sec and TFLOPS metrics

Here are the Megatron training test cases:

test_disable_firewall
test_cleanup_stale_containers
test_launch_megatron_containers
test_llama_3_1_fp8_single_node

Use these scripts to run the Megatron tests.

Single Node 8b MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/singlenode_llama_3_1_8b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/mi3xx_singlenode_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Single Node 8b MI35X#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/singlenode_llama_3_1_8b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/ mi35x_singlenode_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Single Node 70b MI3XX#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/singlenode_llama_3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/mi3xx_singlenode_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Single Node 70b MI35X#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/singlenode_llama_3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/mi35x_singlenode_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Distributed 8b#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/distributed_llama3_1_8b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/ mi3xx_distributed_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Distributed 70b#

pytest -vvv --log-file=/tmp/test.log -s ./tests/training/megatron/distributed_llama3_1_70b.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/training/megatron/ mi3xx_distributed_megatron_llama.json --html=/var/www/html/cvs/megatron.html --capture=tee-sys --self-contained-html

Test results#

The test output is captured in an HTML report generated by PyTest. It provides a summary of the number of test cases that have passed/failed. You can navigate to the logs directly in the browser from this report.

CVS creates a log folder and captures the output of every test in a log file here: $HOME/tmp/test.log

Tip

If there are any errors that prevent the tests from starting, an error message with an appropriate suggestion will be printed in the console.

Test output examples#

Test results are also displayed on the console.

Here’s an example of the output when a test installs the functionality successfully:

Here’s an example of the output when the test runs successfully:

If the test fails, a message displays in the summary info:

You can also cancel tests by pressing Ctrl-C:

Run Cluster Validation Suites tests

Contents

Run Cluster Validation Suites tests#

Run CVS test scripts#

Platform test script#

Burn-in health test scripts#

AGFHC#

TransferBench#

RVS#

InfiniBand (IB Perf) test script#

ROCm Communication Collectives Library (RCCL) test script#

Single node test MI35XX#

Multi-node test MI3XX#

JAX training test scripts#

Single Node 70b MI3XX#

Single Node 70b MI35X#

Distributed 70b MI3XX#

Distributed 405b MI3XX#

Megatron training test scripts#

Single Node 8b MI3XX#

Single Node 8b MI35X#

Single Node 70b MI3XX#

Single Node 70b MI35X#

Distributed 8b#

Distributed 70b#

Test results#

Test output examples#