Third party integration#

This section lists available third-party plugins for the RDC tool such as Prometheus, Grafana, and Reliability, Availability and Serviceability (RAS).

Python bindings#

The RDC tool provides a generic Python class RdcReader, which simplifies telemetry gathering by providing the following functionalities:

  • RdcReader creates the necessary groups and fieldgroups, watch the fields, and fetch the fields for the telemetry fields specified by the user.

  • RdcReader can support embedded and standalone mode. The standalone mode can be with or without authentication.

  • In standalone mode, the RdcReader can automatically reconnect to rdcd if the connection is lost.

  • Restarting rdcd can lead to loss of previously created group and fieldgroup. The RdcReader can recreate them and watch the fields after reconnecting.

  • If the client is restarted, RdcReader can detect the previously created groups and fieldgroups and avoid recreating them.

  • A custom unit converter can be passed to RdcReader to override the default RDC unit.

Here is a sample program to monitor the power and GPU utilization using the RdcReader:

from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *

default_field_ids = [
        rdc_field_t.RDC_FI_POWER_USAGE,
        rdc_field_t.RDC_FI_GPU_UTIL
]

class SimpleRdcReader(RdcReader):
    def __init__(self):
        RdcReader.__init__(self,ip_port=None, field_ids = default_field_ids, update_freq=1000000)
    def handle_field(self, gpu_index, value):
        field_name = self.rdc_util.field_id_string(value.field_id).lower()
        print("%d %d:%s %d" % (value.ts, gpu_index, field_name, value.value.l_int))

if __name__ == '__main__':
    reader = SimpleRdcReader()
while True:
      time.sleep(1)
        reader.process()

In the sample program,

  • Class SimpleRdcReader is derived from the RdcReader.

  • The field ip_port=None in RdcReader dictates that RDC runs in the embedded mode.

  • SimpleRdcReader::process() fetches fields specified in default_field_ids.

Note

RdcReader.py can be found in the python_binding folder located at RDC install path.

To run the example, use:

# Ensure that RDC shared libraries are in the library path and
# RdcReader.py is in PYTHONPATH

$ python SimpleReader.py

Prometheus plugin#

The Prometheus plugin helps to monitor events and send alerts. Prometheus installation and integration details are explained in the following sections.

Prometheus plugin installation#

RDC’s Prometheus plugin rdc_prometheus.py can be found in the python_binding folder.

Here are the steps to install Prometheus:

  1. Install Prometheus client:

    $ pip install prometheus_client
    
  2. Run the Prometheus plugin:

    $ python rdc_prometheus.py
    
  3. Verify plugin:

    $ curl localhost:5000
    
    gpu_util{gpu_index="0"} 0.0
    gpu_clock{gpu_index="0"} 300.0
    gpu_memory_total{gpu_index="0"} 4294.0
    power_usage{gpu_index="0"} 9.0
    gpu_memory_usage{gpu_index="0"} 134.0
    

By default, the plugin runs in the standalone mode and connects to rdcd at localhost:50051 to fetch fields. Ensure that the plugin uses the same authentication mode as rdcd, for example, if rdcd runs with -u/--unauth option, the plugin must also use --rdc_unauth option.

Useful options:

  • To run the plugin in unauthenticated mode, use the --rdc_unauth option.

  • To use the plugin in the embedded mode without rdcd, set the --rdc_embedded option.

  • To override the default fields that are monitored, use the --rdc_fields option to specify the list of fields.

  • To fetch field’s list from a file conveniently, use the --rdc_fields_file option, if the field’s list is long.

  • To control how the fields are cached, use the max_keep_age and max_keep_samples options.

  • To see the metrics of the plugin itself, including the plugin process CPU, memory, file descriptor usage, native threads count, process start and uptimes, set --enable_plugin_monitoring option.

To view the options provided with the plugin, use --help.

% python rdc_prometheus.py --help
usage: rdc_prometheus.py [-h] [--listen_port LISTEN_PORT] [--rdc_embedded]
  [--rdc_ip_port RDC_IP_PORT] [--rdc_unauth]
  [--rdc_update_freq RDC_UPDATE_FREQ]
  [--rdc_max_keep_age RDC_MAX_KEEP_AGE]
  [--rdc_max_keep_samples RDC_MAX_KEEP_SAMPLES]
  [--rdc_fields RDC_FIELDS [RDC_FIELDS ...]]
  [--rdc_fields_file RDC_FIELDS_FILE]
  [--rdc_gpu_indexes RDC_GPU_INDEXES [RDC_GPU_INDEXES ...]]
  [--enable_plugin_monitoring]

RDC Prometheus plugin.

optional arguments:
  -h, --help  show this help message and exit
  --listen_port LISTEN_PORT
  The listen port of the plugin (default: 5000)
  --rdc_embedded  Run RDC in embedded mode (default: standalone mode)
  --rdc_ip_port RDC_IP_PORT
  The rdcd IP and port in standalone mode (default:
  localhost:50051)
  --rdc_unauth  Set this option if the rdcd is running with unauth in
  standalone mode (default: false)
  --rdc_update_freq RDC_UPDATE_FREQ
  The fields update frequency in seconds (default: 10))
  --rdc_max_keep_age RDC_MAX_KEEP_AGE
  The max keep age of the fields in seconds (default:
  3600)
  --rdc_max_keep_samples RDC_MAX_KEEP_SAMPLES
  The max samples to keep for each field in the cache
  (default: 1000)
  --rdc_fields RDC_FIELDS [RDC_FIELDS ...]
  The list of fields name needs to be watched, for
  example, " --rdc_fields RDC_FI_GPU_TEMP
  RDC_FI_POWER_USAGE " (default: fields in the
  plugin)
  --rdc_fields_file RDC_FIELDS_FILE
  The list of fields name can also be read from a file
  with each field name in a separated line (default:
  None)
  --rdc_gpu_indexes RDC_GPU_INDEXES [RDC_GPU_INDEXES ...]
  The list of GPUs to be watched (default: All GPUs)
  --enable_plugin_monitoring
  Set this option to collect process metrics of
  the plugin itself (default: false)

Prometheus integration#

To integrate Prometheus plugin in RDC, follow these steps:

  1. Download and install Prometheus plugin in the management machine.

  2. Configure Prometheus targets:

    Use the example configuration file rdc_prometheus_example.yml in the python_binding folder. This file refers to prometheus_targets.json. Modify prometheus_targets.json to point to your compute nodes. Ensure that this is modified to point to the correct compute nodes.

    // Sample file: prometheus_targets.json
    // Replace rdc_test*.amd.com to point the correct compute nodes
    // Add as many compute nodes as necessary
    [
      {
      "targets": [
      "rdc_test1.amd.com:5000",
      "rdc_test2.amd.com:5000"
      ]
      }
    ]
    

    Note

    In the above example, there are two compute nodes, rdc_test1.adm.com and rdc_test2.adm.com. Ensure that the Prometheus plugin is running on those compute nodes.

  3. Start the Prometheus plugin.

    % prometheus --config.file=<full path of the rdc_prometheus_example.yml>
    
  4. From the management node, open the URL http://localhost:9090 in the browser.

  5. Select one of the available metrics.

../_images/integration_gpu_clock.png

Fig. 4 Prometheus image showing GPU clock for both rdc_test1 and rdc_test2.#

Grafana plugin#

Grafana is a common monitoring stack used for storing and visualizing time series data. Prometheus acts as the storage backend, and Grafana is used as the interface for analysis and visualization. Grafana has a plethora of visualization options and can be integrated with Prometheus for RDC’s dashboard.

Grafana plugin installation#

To install Grafana plugin, follow these steps:

  1. Download Grafana.

  2. Follow the instructions to install Grafana.

  3. To start Grafana, use:

    $ sudo systemctl start grafana-server
    $ sudo systemctl status grafana-server
    
  4. Open http://localhost:3000/ in the browser.

  5. Login using the default username and password (admin/admin) as shown in the following image:

../_images/integration_login.png

Grafana integration#

As a prerequisite, ensure:

  • The RDC Prometheus plugin is running in each compute node.

  • Prometheus is set up to collect metrics from the plugin.

Grafana configuration#

Firstly, add Prometheus as data source using the following steps:

  1. Go to “Configuration”.

    ../_images/integration_config1.png
  2. Select “Data Sources”.

    ../_images/integration_config2.png
  3. Go to “Add data source”.

    ../_images/integration_config3.png
  4. Select “Prometheus”.

    ../_images/integration_config4.png

Note

Ensure the name of the data source is Prometheus. If Prometheus and Grafana are running on the same machine, use the default URL http://localhost:9090. Otherwise, ensure the URL matches the Prometheus URL, save, and test it.

../_images/integration_config5.png

Then, import RDC dashboard using the following steps:

  1. Go to “+” and select “Import”.

  2. Upload rdc_grafana_dashboard_example.json from the python_binding folder.

  3. Select the desired compute node for visualization.

    ../_images/integration_config6.png

Prometheus (Grafana) integration with automatic node detection#

RDC provides Consul to discover the rdc_prometheus service automatically. Consul is a service mesh solution providing a fully featured control plane with service discovery, configuration, and segmentation functionality. For more information, see Consul.

RDC uses Consul for health checks of RDC’s integration with the Prometheus plugin (rdc_prometheus). These checks provide information on its efficiency.

With the Consul agent integration, a new compute node can be discovered automatically, which saves users from manually changing prometheus_targets.json to use Consul.

Installing the Consul agent for compute and management nodes#

To install the latest Consul agent for compute and management nodes, follow these steps:

  1. To download and install the Consul agent, set up the apt repository:

    $ curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
    $ sudo apt-add-repository "deb [arch=amd64]   https://apt.releases.hashicorp.com $(lsb_release -cs) main"
    $ sudo apt-get update && sudo apt-get install consul
    
  2. Generate a key to encrypt the communication between Consul agents. The same key is used by both the compute and management nodes for communication.

    $ consul keygen
    

    For demonstration purposes, the following key is used in the configuration file:

    $ consul keygen
    4lgGQXr3/R2QeTi5vEp7q5Xs1KoYBhCsk9+VgJZZHAo=
    

Setting up the Consul server in management nodes#

While Consul can function with one server, it’s recommended to use three to five servers to avoid failure scenarios leading to data loss.

Note

For demonstration purposes, the configuration settings documented below are for a single server.

To set up Consul server, follow these steps:

  1. Create a configuration file /etc/consul.d/server.hcl.

    server = true
    encrypt = "<CONSUL_ENCRYPTION_KEY>"
    bootstrap_expect = 1
    ui = true
    client_addr = "0.0.0.0"
    bind_addr = "<The IP address can be reached by client>"
    

    Here is how to use the variables in the configuration file:

    • Run the agent in server mode by setting server to true.

    • Set encrypt to the key generated in the first step.

    • The bootstrap_expect variable indicates the number of servers required to form the first Consul cluster. Set this variable to 1 to allow a cluster with a single server.

    • The User Interface (ui) variable when set to true enables the Consul web UI.

    • The client_addr variable is used to connect the API and UI.

    • The bind_addr variable is used to connect the client to the server. If you have multiple private IP addresses, use the address that can connect to a client.

  2. Start the agent.

    $ sudo consul agent -config-dir=/etc/consul.d/
    
  3. Browse to http://localhost:8500/ on the management node to see a single instance running.

Setting up the Consul client in compute nodes#

To set up Consul client, follow these steps:

  1. Create a configuration file /etc/consul.d/client.hcl.

    server = false
    encrypt = "<CONSUL_ENCRYPTION_KEY>"
    retry_join = ["<The consul server address>"]
    client_addr = "0.0.0.0"
    bind_addr = "<The IP address can reach server>"
    

    Note

    Use the same CONSUL_ENCRYPTION_KEY as the servers. In the retry_join, use the IP address of the management nodes.

  2. Start the Consul agent.

    $ sudo consul agent -config-dir=/etc/consul.d/
    

    To see if the client has joined the Consul, use:

    $ consul members
    Node              Address           Status  Type    Build  Protocol  DC   Segment
    management-node   10.4.22.70:8301   alive   server  1.9.3  2    dc1  <all>
    compute-node      10.4.22.112:8301  alive   client  1.9.3  2    dc1  <default>
    
  3. Set up the Consul client to monitor the health of the RDC Prometheus plugin.

  4. Start the RDC Prometheus plugin.

    $ python rdc_prometheus.py --rdc_embedded
    
  5. Add the configuration file /etc/consul.d/rdc_prometheus.hcl.

    {
      "service": {
        "name": "rdc_prometheus",
        "tags": [
          "rdc_prometheus",
          "rdc"
        ],
        "port": 5000,
        "check": {
          "id": "rdc_plugin",
          "name": "RDC Prometheus plugin on port 5000",
          "http": "http://localhost:5000",
          "method": "GET",
          "interval": "15s",
          "timeout": "1s"
        }
      }
    }
    

    Note

    By default, the Prometheus plugin uses port 5000. If you don’t use the default setting, change the configuration file accordingly.

  6. After updating the configuration file, restart the Consul client agent.

    $ sudo consul agent -config-dir=/etc/consul.d/
    
  7. Enable the Prometheus integration in the management node.

  8. In the management node, inspect the service.

    $ consul catalog nodes -service=rdc_prometheus
    
    Node              ID        Address      DC
    compute-node      76694ab1  10.4.22.112  dc1
    
  9. Create a new Prometheus configuration rdc_prometheus_consul.yml file for the Consul integration.

    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    scrape_configs:
      - job_name: 'consul'
        consul_sd_configs:
          - server: 'localhost:8500'
        relabel_configs:
          - source_labels: [__meta_consul_tags]
            regex: .*,rdc,.*
            action: keep
          - source_labels: [__meta_consul_service]
            target_label: job
    

    Note

    When running the Consul server and Prometheus on the same machine, change the server under consul_sd_configs to your Consul server address.

  10. Start Prometheus.

    $ ./prometheus --config.file="rdc_prometheus_consul.yml"
    
  11. Browse the Prometheus UI at http://localhost:9090 on the management node and query RDC Prometheus metrics. Ensure that the plugin starts before running the query.

Reliability, Availability, and Serviceability plugin#

The Reliability, Availability, and Serviceability plugin (RAS) plugin helps to monitor and count ECC (Error-Correcting Code) errors. The following sections provide information on integrating RAS with RDC.

RAS plugin installation#

With the RAS feature enabled in the graphic card, you can use RDC to monitor RAS errors.

Prerequisite#

  • Ensure that the GPU supports RAS.

    Note

    The RAS library is installed as part of the RDC installation. No additional configuration is required for RDC.

  • RDC installation dynamically loads the RAS library librdc_ras.so. The configuration files required by the RAS library are installed in the sp3 and config folders.

    % ls /opt/rocm-4.2.0/rdc/lib
    ... librdc_ras.so ...
    ... sp3 ... config ...
    

RAS integration#

RAS exposes a list of ECC correctable and uncorrectable errors for different IP blocks and helps to troubleshoot issues.

Example:

$ rdci dmon -i 0 -e 600,601

Where, the dmon command monitors GPU index 0, and fields 600 and 601, where 600 is the field ID for the ECC_CORRECT counter and 601 for the ECC_UNCORRECT counter.

% rdci dmon -l
... ...
600 RDC_FI_ECC_CORRECT_TOTAL   : Accumulated Single Error Correction
601 RDC_FI_ECC_UNCORRECT_TOTAL : Accumulated Double Error Detection
602 RDC_FI_ECC_SDMA_CE         : SDMA Correctable Error
603 RDC_FI_ECC_SDMA_UE         : SDMA Uncorrectable Error
604 RDC_FI_ECC_GFX_CE          : GFX Correctable Error
605 RDC_FI_ECC_GFX_UE          : GFX Uncorrectable Error
606 RDC_FI_ECC_MMHUB_CE        : MMHUB Correctable Error
607 RDC_FI_ECC_MMHUB_UE        : MMHUB Uncorrectable Error
608 RDC_FI_ECC_ATHUB_CE        : ATHUB Correctable Error
609 RDC_FI_ECC_ATHUB_UE        : ATHUB Uncorrectable Error
610 RDC_FI_ECC_PCIE_BIF_CE     : PCIE_BIF Correctable Error
611 RDC_FI_ECC_PCIE_BIF_UE     : PCIE_BIF Uncorrectable Error
612 RDC_FI_ECC_HDP_CE          : HDP Correctable Error
613 RDC_FI_ECC_HDP_UE          : HDP Uncorrectable Error
614 RDC_FI_ECC_XGMI_WAFL_CE    : XGMI WAFL Correctable Error
615 RDC_FI_ECC_XGMI_WAFL_UE    : XGMI WAFL Uncorrectable Error
616 RDC_FI_ECC_DF_CE           : DF Correctable Error
617 RDC_FI_ECC_DF_UE           : DF Uncorrectable Error
618 RDC_FI_ECC_SMN_CE          : SMN Correctable Error
619 RDC_FI_ECC_SMN_UE          : SMN Uncorrectable Error
620 RDC_FI_ECC_SEM_CE          : SEM Correctable Error
621 RDC_FI_ECC_SEM_UE          : SEM Uncorrectable Error
622 RDC_FI_ECC_MP0_CE          : MP0 Correctable Error
623 RDC_FI_ECC_MP0_UE          : MP0 Uncorrectable Error
624 RDC_FI_ECC_MP1_CE          : MP1 Correctable Error
625 RDC_FI_ECC_MP1_UE          : MP1 Uncorrectable Error
626 RDC_FI_ECC_FUSE_CE         : FUSE Correctable Error
627 RDC_FI_ECC_FUSE_UE         : FUSE Uncorrectable Error
628 RDC_FI_ECC_UMC_CE          : UMC Correctable Error
629 RDC_FI_ECC_UMC_UE          : UMC Uncorrectable Error
630 RDC_FI_ECC_MCA_CE          : MCA Correctable Error
631 RDC_FI_ECC_MCA_UE          : MCA Uncorrectable Error
632 RDC_FI_ECC_VCN_CE          : VCN Correctable Error
633 RDC_FI_ECC_VCN_UE          : VCN Uncorrectable Error
634 RDC_FI_ECC_JPEG_CE         : JPEG Correctable Error
635 RDC_FI_ECC_JPEG_UE         : JPEG Uncorrectable Error
636 RDC_FI_ECC_IH_CE           : IH Correctable Error
637 RDC_FI_ECC_IH_UE           : IH Uncorrectable Error
638 RDC_FI_ECC_MPIO_CE         : MPIO Correctable Error
639 RDC_FI_ECC_MPIO_UE         : MPIO Uncorrectable Error
... ...

To access the ECC correctable and uncorrectable error counters, use:

% rdci dmon -i 0 -e 600,601

GPU     ECC_CORRECT         ECC_UNCORRECT
0       0                   0
0       0                   0
0       0                   0