Installing and running RDC#
The ROCm Data Center tool (RDC) is part of the AMD ROCm software and available on the distributions supported by AMD ROCm. For RDC installation from prebuilt packages, follow the instructions in this section.
Prerequisites#
The installation dependencies are described in Dependencies in the README. To see the list of supported operating systems, refer to System requirements.
Install gRPC#
To see the instructions for building gRPC
and protoc
, refer to Building gRPC and protoc.
Authentication keys#
RDC can be used with or without authentication. If authentication is required you must configure proper authentication keys as described in Authentication in Building and testing RDC.
Prebuilt packages#
RDC is packaged as part of the ROCm software repository. You must install the AMD ROCm software before installing RDC, as described in ROCm installation.
To install RDC after installing the ROCm package, use the following instructions.
$ sudo apt-get install rdc
# or, to install a specific version
$ sudo apt-get install rdc<x.y.z>
$ sudo zypper install rdc
# or, to install a specific version
$ sudo zypper install rdc<x.y.z>
Components#
The components of the RDC tool are as shown below:
RDC (API) library#
This library is the central piece, which interacts with different modules and provides all the features described. This shared library provides C API and Python bindings so that third-party tools should be able to use it directly if required.
RDC daemon (rdcd
)#
The rdcd
daemon records telemetry information from GPUs. It also provides an interface to RDC command-line tool (rdci
) running locally or remotely. It relies on the above RDC Library for all the core features.
RDC command-line tool (rdci
)#
A command-line tool to invoke all the features of the RDC tool. This CLI can be run locally or remotely.
AMDSMI library#
A stateless system management library that provides low-level interfaces to access GPU information
Starting RDC#
The RDC tool can be run in the following two modes. The feature set is similar in both the cases. You have the flexibility to choose the option that best fits your environment.
The capability in each mode depends on the privileges you have for starting the RDC tool. A normal user has access only to monitor (GPU telemetry) capabilities. A privileged user can run the tool with full capability. In the full capability mode, GPU configuration features can be invoked. This may or may not affect all the users and processes sharing the GPU.
Standalone mode#
This is the preferred mode of operation, as it does not have any external dependencies. To start RDC in standalone mode, RDC Server Daemon (rdcd
) must run on each compute node. Refer to Terminology in Introduction to the RDC tool for more information. You can start rdcd
as a systemd
service or directly from the command-line.
Start the RDC tool using systemd
#
If multiple RDC versions are installed, copy /opt/rocm-<x.y.z>/rdc/lib/rdc.service, which is installed with the desired RDC version, to the systemd
folder. The capability of RDC can be configured by modifying the rdc.service
system configuration file. Use the systemctl
command to start rdcd
.
$ systemctl start rdc
By default, rdcd
starts with full capability. To change to monitor only, comment out the following two lines:
$ sudo vi /lib/systemd/system/rdc.service
# CapabilityBoundingSet=CAP_DAC_OVERRIDE
# AmbientCapabilities=CAP_DAC_OVERRIDE
Note
rdcd
can be started by using the systemctl
command.
$ systemctl start rdc
If the GPU reset fails, restart the server. Note that restarting the server also initiates rdcd
. You may then encounter the following two scenarios:
rdcd
returns the correct GPU information tordci
rdcd
returns the “No GPUs found on the system” error tordci
. To resolve this error, restartrdcd
with the following instruction:
$ sudo systemctl restart rdcd
Start the RDC tool from the command-line#
While systemctl
is the preferred way to start rdcd
, you can also start directly from the command-line. The installation scripts create a default user - rdc
. Users have the option to edit the profile file (rdc.service
installed at /lib/systemd/system
) and change these lines accordingly:
[Service]
User=rdc
Group=rdc
From the command-line, start rdcd
as a user such as rdc
, or start it as root
:
#Start as user rdc
$ sudo -u rdc rdcd
# Start as root
$ sudo rdcd
In this use case, the rdc.service
file mentioned in the previous section is not involved. Here, the capability of RDC is determined by the privilege of the user starting rdcd
. If rdcd
is running under a normal user account it has the monitor-only capability. If rdcd
is running as root
then it has the full capability.
Note
If a user other than rdc
or root
starts the rdcd
daemon, the file ownership of the SSL keys mentioned in the Authentication section must be modified to allow read and write access.
Troubleshoot rdcd
#
When rdcd
is started using systemctl
, the logs can be viewed using the following command:
$ journalctl -u rdc
These messages provide useful status and debugging information. The logs can also help debug problems like rdcd
failing to start, communication issues with a client, and others.
Embedded mode#
The embedded mode is useful if the end user has a monitoring agent running on the compute node. The monitoring agent can directly use the RDC library and will have a finer-grain control on how and when RDC features are invoked. For example, if the monitoring agent has a facility to synchronize across multiple nodes, it can synchronize GPU telemetry across these nodes.
The RDC daemon rdcd
can be used as a reference code for this purpose. The dependency on gRPC
is also eliminated if the RDC library is directly used.
Caution
RDC command-line rdci
will not function in this mode. Third-party monitoring software is responsible for providing the user interface and remote access/monitoring.