Health test configuration file#
2025-11-12
7 min read time
The burn-in health tests are single-node diagnostic tests that validate the hardware and firmware versions’ functionality and performance.
Here’s a code snippet of the mi300_health_config.json file for reference:
Note
In this configuration file, {user-id} will be resolved to the current username in the runtime. You can also manually change this value to your username.
mi300_health_config.json
{
"agfhc":
{
"path": "/opt/amd/agfhc",
"package_tar_ball": "/home/{user-id}/PACKAGES/agfhc-mi300x_1.22.0_ub2204.tar.bz2",
"install_dir": "/home/{user-id}/INSTALL/agfhc/",
"_comments_log_dir": "log_dir has to be a NON NFS local file system",
"log_dir": "/root/agfhc_logs",
"nfs_install": "True",
"hbm_test_duration": "00:01:30"
},
"transferbench":
{
"path": "/home/{user-id}/INSTALL/TransferBench",
"example_tests_path": "/home/{user-id}/INSTALL/TransferBench/examples",
"git_install_path": "/home/{user-id}/INSTALL/",
"git_url": "https://github.com/ROCm/TransferBench.git",
"nfs_install": "True",
"results":
{
"bytes_to_transfer": "268435456",
"path": "/home/{user-id}/INSTALL/TransferBench",
"gpu_to_gpu_a2a_rtotal": "320.0",
"avg_gpu_to_gpu_p2p_unidir_bw": "33.9",
"avg_gpu_to_gpu_p2p_bidir_bw": "43.9",
"best_gpu0_bw": "480.0",
"32_cu_local_read": "1650",
"32_cu_local_write": "1250.0",
"32_cu_local_copy": "1250.0",
"32_cu_rem_read": "48.0",
"32_cu_rem_write": "48.0",
"32_cu_rem_copy": "48.0",
"example_results":
{
"test1": "47.1",
"test2": "48.4",
"test3_0_to_1": "31.9",
"test3_1_to_0": "38.9",
"test4": "1264",
"test6": "48.6"
}
}
},
"rvs":
{
"path": "/opt/rocm/bin",
"git_install_path": "/home/{user-id}/INSTALL/rvs",
"git_url": "https://github.com/ROCm/ROCmValidationSuite.git",
"nfs_install": "True",
"config_path_mi300x": "/opt/rocm/share/rocm-validation-suite/conf/MI300X",
"config_path_default": "/opt/rocm/share/rocm-validation-suite/conf",
"_comment_rvs_test_level": "RVS test level configuration (0-5). 0: Run individual tests (skip level test), 1-5: Run LEVEL config test if RVS >= 1.3.0, else run individual tests. Default is 4.",
"rvs_test_level": 4,
"tests": [
{
"name": "level_config",
"description": "RVS LEVEL Configuration Test - Runs all modules collectively",
"timeout": 14400,
"expected_pass": true,
"fail_regex_patterns": [
"met:\\s*FALSE",
"pass:\\s*FALSE",
"\\[ERROR\\s*\\]",
"FAIL",
"ERROR:",
"peqt false",
"RVS-ERROR",
"Missing packages\\s*:\\s*([1-9]\\d*)",
"Version mismatch packages\\s*:\\s*([1-9]\\d*)"
]
},
{
"name": "gpup_single",
"config_file": "gpup_single.conf",
"description": "GPU Properties Test",
"timeout": 1800,
"expected_pass": true,
"fail_regex_pattern": "FAIL|ERROR|RVS-ERROR"
},
{
"name": "mem_test",
"config_file": "mem.conf",
"description": "Memory Test",
"timeout": 10000,
"expected_pass": true,
"fail_regex_pattern": "FAIL|\\[ERROR\\s*\\]|RVS-ERROR"
},
{
"name": "gst_single",
"config_file": "gst_single.conf",
"description": "GPU Stress Test",
"timeout": 18000,
"expected_pass": true,
"fail_regex_pattern": "met:\\s*FALSE|RVS-ERROR"
},
{
"name": "iet_single",
"config_file": "iet_single.conf",
"description": "Input EDPp Test",
"timeout": 3600,
"expected_pass": true,
"fail_regex_pattern": "pass:\\s*FALSE|RVS-ERROR"
},
{
"name": "pebb_single",
"config_file": "pebb_single.conf",
"description": "PCI Express Bandwidth Benchmark",
"timeout": 3600,
"expected_pass": true,
"fail_regex_pattern": "\\[ERROR\\s*\\]|RVS-ERROR"
},
{
"name": "pbqt_single",
"config_file": "pbqt_single.conf",
"description": "P2P Benchmark and Qualification Tool",
"timeout": 3600,
"expected_pass": true,
"fail_regex_pattern": "FAIL|ERROR:|RVS-ERROR"
},
{
"name": "peqt_single",
"config_file": "peqt_single.conf",
"description": "PCI Express Qualification Tool",
"timeout": 1800,
"expected_pass": true,
"fail_regex_pattern": "peqt false|RVS-ERROR"
},
{
"name": "rcqt_single",
"config_file": "rcqt_single.conf",
"description": "ROCm Configuration Qualification Tool",
"timeout": 1800,
"expected_pass": "true",
"fail_regex_pattern": "\\[ERROR\\s*\\]|RVS-ERROR|Missing packages\\s*:\\s*([1-9]\\d*)|Version mismatch packages\\s*:\\s*([1-9]\\d*)"
},
{
"name": "tst_single",
"config_file": "tst_single.conf",
"description": "Thermal Stress Test",
"timeout": 1800,
"expected_pass": true,
"fail_regex_pattern": "pass: FLASE|RVS-ERROR"
},
{
"name": "babel_stream",
"config_file": "babel.conf",
"description": "BABEL Benchmark Test",
"timeout": 9000,
"expected_pass": true,
"fail_regex_pattern": "\\[ERROR\\s*\\]|RVS-ERROR"
}
]
}
}
Parameters#
Here’s an exhaustive list of the available parameters in the Health configuration file.
AGFHC#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Path where AGFHC is installed |
|
|
Path where the tar ball is downloaded |
|
|
Path where AGFHC runs |
|
|
Path where AGFHC runs |
|
|
Log directory |
|
True |
Set the flag to install nfs |
|
00:01:30 |
HBM test duration |
TransferBench#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Path where TransferBench is installed |
|
|
Path where TransferBench examples are installed |
|
|
Path where the Git repo is installed |
|
URL for Git repo |
|
|
True |
Set the flag to install nfs |
|
268435456 |
Amount of data to transfer in bytes (256 MB); this is the payload size for bandwidth tests |
|
320.0 |
All-to-all communication total bandwidth in GB/s across all GPUs |
|
33.9 |
Average peer-to-peer unidirectional bandwidth (GB/s) between GPU pairs |
|
43.9 |
Average peer-to-peer bidirectional bandwidth (GB/s) between GPU pair |
|
480.0 |
Best measured bandwidth (GB/s) for GPU 0 |
|
1650 |
Local memory read bandwidth (GB/s) using 32 CUs |
|
1250.0 |
Local memory write bandwidth (GB/s) using 32 CUs |
|
1250.0 |
Local memory copy bandwidth (GB/s) using 32 CUs |
|
48.0 |
Remote memory read bandwidth (GB/s) using 32 CUs |
|
48.0 |
Remote memory write bandwidth (GB/s) using 32 CUs |
|
48.0 |
Remote memory copy bandwidth (GB/s) using 32 CUs |
|
47.1 |
Specific benchmark result (likely bandwidth in GB/s) |
|
48.4 |
Another benchmark result |
|
31.9 |
Directional test from GPU 0 to GPU 1 |
|
38.9 |
Directional test from GPU 1 to GPU 0 |
|
1264 |
High-performance test result (possibly local memory) |
|
48.6 |
Additional benchmark result |
ROCm Validation Suite (RVS)#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Path where the RVS test is installed |
|
|
Path to installed Git repo |
|
URL for Git repo |
|
|
True |
Set the flag to install nfs |
|
|
Path for Instinct MI300X configuration |
|
|
Default path for RVS |
|
“RVS test level configuration (0-5). 0: Run individual tests (skip level test), 1-5: Run LEVEL config test if RVS >= 1.3.0, else run individual tests. Default is 4.” |
RVS test comments |
|
4 |
Test level |
|
|
Test name |
|
RVS LEVEL Configuration Test - Runs all modules collectively |
Test description |
|
14400 |
Timeout in secs |
|
True |
Result |
|
|
Regular expressions |
|
|
Test name |
|
|
Test config file |
|
GPU Properties Test |
Test description |
|
1800 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
Memory test |
Test description |
|
10000 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
GPU Stress Test |
Test description |
|
18000 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
Input EDPp Test |
Test description |
|
3600 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
P2P Benchmark and Qualification Tool |
Test description |
|
3600 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
PCI Express Qualification Tool |
Test description |
|
1800 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
ROCm Configuration Qualification Tool |
Test description |
|
1800 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
Thermal Stress Test |
Test description |
|
1800 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |
|
|
Test name |
|
|
Test config file |
|
BABEL Benchmark Test |
Test description |
|
9000 |
Timeout in secs |
|
True |
Result |
|
|
Failure expression |