RDC tool feature overview#
This topic provides information related to the features of the RDC tool.
 
Fig. 3 RDC components and framework for describing features#
Discovery#
The Discovery feature enables you to locate and display information of GPUs present in the compute node.
Example:
$ rdci discovery <host_name> -l
2 GPUs found
| GPU Index | Device Information | 
| 0 | Name: AMD Radeon Instinct MI50 Accelerator | 
| 1 | Name: AMD Radeon Instinct MI50 Accelerator | 
$ rdci -l : list available GPUs
$ rdci -u: No SSL authentication
Groups#
This section explains the GPU and field groups features.
GPU Groups#
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
$ rdci group -c GPU_GROUP
Successfully created a group with a group ID 1
$ rdci group -g 1 -a 0,1
Successfully added the GPU 0,1 to group 1
$ rdci group –l
1 group found
| Group ID | Group Name | GPU Index | 
| 1 | GPU_GROUP | 0, 1 | 
$ rdci group -d 1
Successfully removed group 1
-c create; –g group id; –a add GPU index; –l list; -d delete group
Field Groups#
The Field Groups feature provides you the options to create, delete, and list field groups.
$ rdci fieldgroup -c <fgroup> -f 150,155
Successfully created a field group with a group ID 1
$ rdci fieldgroup -l
1 group found
| Group ID | Group Name | Field IDs | 
| 1 | Fgroup | 150, 155 | 
$ rdci fieldgroup -d 1
Successfully removed field group 1
rdci dmon –l
Supported fields Ids:
100 RDC_FI_GPU_CLOCK:  Current GPU clock freq.
150 RDC_FI_GPU_TEMP:  GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE:  Power usage in microwatts.
203 RDC_FI_GPU_UTIL:  GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
-c create; –g group id; –a add GPU index; –l list; -d delete group
Monitor Errors#
You can define RDC_FI_ECC_CORRECT_TOTAL or RDC_FI_ECC_UNCORRECT_TOTAL field to get the RAS Error-Correcting Code (ECC) counter:
- 312 - RDC_FI_ECC_CORRECT_TOTAL: Accumulated correctable ECC errors
- 313 - RDC_FI_ECC_UNCORRECT_TOTAL: Accumulated uncorrectable ECC errors
Device Monitoring#
The RDC tool enables you to monitor the GPU fields.
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
1 group found
| GPU Index | TEMP (m°C) | POWER (µW) | 
| 0 | 25000 | 520500 | 
rdci dmon –l
Supported fields Ids:
100 RDC_FI_GPU_CLOCK:  Current GPU clock freq.
150 RDC_FI_GPU_TEMP:  GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE:  Power usage in microwatts.
203 RDC_FI_GPU_UTIL:  GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id
Job Stats#
You can display GPU statistics for any given workload.
  $ rdci stats -s 2 -g 1
Successfully started recording job 2 with a group ID 1
$ rdci stats -j 2
| Summary | Executive Status | 
| Start time | 1586795401 | 
| End time | 1586795445 | 
| Total execution time | 44 | 
| Energy Consumed (Joules) | 21682 | 
| Power Usage (Watts) | Max: 49 Min: 13 Avg: 34 | 
| GPU Clock (MHz) | Max: 1000 Min: 300 Avg: 903 | 
| GPU Utilization (%) | Max: 69 Min: 0 Avg: 2 | 
| Max GPU Memory Used (bytes) | 524320768 | 
| Memory Utilization (%) | Max: 12 Min: 11 Avg: 12 | 
$ rdci stats -x 2
Successfully stopped recording job 2
-s start recording on job id; -g group id; -j display job stats; –x stop recording.
Job Stats Use Case#
A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:
 
Fig. 4 An example showing how job statistics can be recorded#
rdci commands#
$ rdci group -c group1
successfully created a group with a group ID 1
$ rdci group -g 1 -a 0,1
GPU 0,1 is added to group 1 successfully.
rdci stats -s 123 -g 1
job 123 recorded successfully with the group ID
rdci stats -x 123
job 123 stops recording successfully
rdci stats -j 123
job stats printed
Error-Correcting Code Output#
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
Diagnostic#
You can run diagnostic on a GPU group as shown below:
$ rdci diag -g <gpu_group>
No compute process:  Pass
Node topology check:  Pass
GPU parameters check:  Pass
Compute Queue ready:  Pass
System memory check:  Pass
=============== Diagnostic Details ==================
No compute process:  No processes running on any devices.
Node topology check:  No link detected.
GPU parameters check:  GPU 0 Critical Edge temperature in range.
Compute Queue ready:  Run binary search task on GPU 0 Pass.
System memory check:  Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.