This page contains proposed changes for a future release of ROCm. Read the latest Linux release of ROCm documentation for your production environments.

Using TransferBench#

You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:

  • Coarse-grained pinned host

  • Unpinned host

  • Fine-grained host

  • Coarse-grained global device

  • Fine-grained global device

  • Null (for an empty transfer)

In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.

You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor. For a DMA executor, the SE argument determines the number of streams to be used.

You can specify the transfers in a configuration file or use preset configurations for transfers.

Specifying transfers in a configuration file#

A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations. This simplifies to a copy operation when using a single SRC or DST. Here’s a copy operation from a single SRC to DST:

SRC 0                DST 0
SRC 1 -> Executor -> DST 1
SRC X                DST Y

Three executors are supported by TransferBench:

Executor:        SubExecutor:
1. CPU           CPU thread
2. GPU           GPU threadblock/Compute Unit (CU)
3. DMA           N/A (Can only be used for a single SRC to DST copy)

Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.

There are two ways to specify a test:

  • Basic

    The basic specification assumes the same number of SEs used per transfer. A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:

    Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
    

    The arguments used to specify transfers in the config file are described in the arguments table.

    Example:

    1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
    1 4 (G2->C1->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
    2 4 G0->G0->G1 G1->G1->G0          Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
    
  • Advanced

    In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer. Specifying a non-zero number of bytes overrides any provided value.

    Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
    

    The arguments used to specify transfers in the config file are described in the arguments table.

    Example:

    -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
    

Here is the list of arguments used to specify transfers in the config file:

Argument

Description

Transfers

Number of transfers to be run in parallel

SE

Number of SEs to use (CPU threads or GPU threadblocks)

srcMemL

Source memory locations (where the data is read)

Executor

Executor is specified by a character indicating type, followed by the device index (0-indexed):
- C: CPU-executed (indexed from 0 to NUMA nodes - 1)
- G: GPU-executed (indexed from 0 to GPUs - 1)
- D: DMA-executor (indexed from 0 to GPUs - 1)

dstMemL

Destination memory locations (where the data is written)

bytesL

Number of bytes to copy (use command-line specified size when 0).
Must be a multiple of four and can be suffixed with (‘K’,’M’, or ‘G’).
Memory locations are specified by one or more device characters or device index pairs.
Characters indicate memory type and are followed by device index (0-indexed).
Here are the characters and their respective memory locations:
- C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- G: Global device memory (on GPU device, indexed from 0 to [GPUs - 1])
- F: Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
- N: Null memory (index ignored)

Round brackets and arrows “->” can be included for human clarity, but will be ignored. Lines starting with # are ignored while lines starting with ## are echoed to the output.

Transfer examples:

Single GPU-executed transfer between GPU 0 and 1 using 4 CUs:

1 4 (G0->G0->G1)

Single DMA-executed transfer between GPU 0 and 1:

1 1 (G0->D0->G1)

Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs:

-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)

“Memset” by GPU 0 to GPU 0 memory:

1 32 (N0->G0->G0)

“Read-only” by CPU 0:

1 4 (C0->C0->N0)

Broadcast from GPU 0 to GPU 0 and GPU 1:

1 16 (G0->G0->G0G1)

Note

Running TransferBench with no arguments displays usage instructions and detected topology information.

Using preset configurations#

Here is the list of preset configurations that can be used instead of configuration files:

Configuration

Description

a2a

All-to-all benchmark test

cmdline

Allows transfers to run from the command line instead of a configuration file

healthcheck

Simple health check (supported on AMD Instinct MI300 series only)

p2p

Peer-to-peer benchmark test

pcopy

Benchmark parallel copies from a single GPU to other GPUs

rsweep

Random sweep across possible sets of transfers

rwrite

Benchmark parallel remote writes from a single GPU to other GPUs

scaling

GPU subexecutor scaling tests

schmoo

Read, write, or copy operation on local or remote between two GPUs

sweep

Sweep across possible sets of transfers

Performance tuning#

When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting USE_SINGLE_STREAM=0, the performance might be serialized due to the maximum number of hardware queues available. To improve the performance, adjust the number of maximum hardware queues using GPU_MAX_HW_QUEUES.