Introduction to the RDC tool#
The ROCm Data Center tool (RDC) simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are:
GPU telemetry
GPU statistics for jobs
Integration with third-party tools
Open source
You can use the RDC tool in standalone mode if all components are installed. However, the existing management tools can use the same set of features available in a library format.
For details on different modes of operation, refer to Starting RDC in Installing and running RDC.
Target Audience#
The audience for the AMD RDC tool consists of:
Administrators: RDC provides the cluster administrator with the capability of monitoring, validating, and configuring policies.
HPC Users: Provides GPU-centric feedback for their workload submissions.
OEM: Add GPU information to their existing cluster management software.
Open source Contributors: RDC is open source and accepts contributions from the community.
Objective#
This documentation will:
Introduce the tool features in RDC tool feature overview
Describe integration with external tools in Third party integration
Provide an open source handbook in Building and testing RDC
Introduce elements of the tool API in Introduction to RDC API
Terminology#
Terms |
Description |
RDC |
ROCm Data Center tool |
Compute node (CN) |
One of many nodes containing one or more GPUs in the Data Center on which compute jobs are run |
Management node (MN) or Main console |
A machine running system administration applications to administer and manage the Data Center |
GPU Groups |
Logical grouping of one or more GPUs in a compute node |
Fields |
A metric that can be monitored by the RDC, such as GPU temperature, memory usage, and power usage |
Field Groups |
Logical grouping of multiple fields |
Job |
A workload that is submitted to one or more compute nodes |