rocSHMEM environment variables#
This section describes the important environment variables used to control the behavior of rocSHMEM.
Environment variable |
Default value |
Value |
|---|---|---|
ROCSHMEM_DEBUG_LEVELDebug output level
|
|
Levels (from least to most verbose):
NONE: Suppress all output.ERROR: Print error messages only.WARN: Print warnings and errors (default).ENV: Print modified environment variables at startup.VERSION: Print build/version information at startup.INFO: Print informational messages and above.API: Print API call tracing (requires BUILD_DEBUG_TRACE_HOST/BUILD_DEBUG_TRACE_DEVICE).TRACE: Print all messages including internal traces (requires BUILD_DEBUG_TRACE_HOST/BUILD_DEBUG_TRACE_DEVICE).Modifiers can be appended with
: to suppress specific categories::noerror, :nowarn, :noenv, :noversion, :noinfo, :noapi, :notrace:full or :all after env or :env modifier controls env print detail.:color (default) or :nocolor enables/disables ANSI color output.Examples:
trace:noversion, env:full, api:noenv, trace:nocolor |
ROCSHMEM_HEAP_SIZEDefines the size of the rocSHMEM symmetric heap in bytes (per PE).
|
|
Size in bytes (per PE).
Note: the heap is on GPU memory.
|
ROCSHMEM_MAX_NUM_HOST_CONTEXTSMaximum number of host-side communication contexts
|
|
Maximum number of host-side contexts. |
ROCSHMEM_MAX_NUM_CONTEXTSDefines the number of contexts an application can use.
|
|
Maximum number of contexts. |
ROCSHMEM_MAX_NUM_TEAMSDefines the number of teams an application can use.
|
|
Maximum number of teams. |
ROCSHMEM_BACKENDWhen rocSHMEM is compiled for all backends, this environment variable
selects which backend to execute. The default value is an empty string and rocSHMEM auto-selects the most appropriate backend.
|
`` `` |
ipc: IPC Backendro: Reverse Offload Backendgda: GPU Direct Async Backend |
ROCSHMEM_UNIQUEID_WITH_MPIDefines whether rocSHMEM is expected to use MPI internally when using the uniqueId based initialization.
|
|
0: Do not use MPI.1: Use MPI. |
ROCSHMEM_DISABLE_MIXED_IPCDefines whether to force using the network conduit even when IPC is available.
|
|
0: Use IPC when available.1: Force network conduit. |
ROCSHMEM_USE_IB_HCAForces the NIC that this PE uses. When this value is set NIC auto-detection and mapping is disabled, the NIC specified in the variable
will be selected. The default value is an empty string and rocSHMEM auto-detects the most appropriate NIC.
|
`` `` |
Example value:
bnxt_re0 |
ROCSHMEM_HCA_LISTComma separated list of NIC names that can be used by rocSHMEM. Unlike
ROCSHMEM_USE_IB_HCA, when this variable is set,NIC auto-detection and mapping still executes, but NICs that are not in the list are discarded before auto-detection runs.
Prefixing the list with
^ turns the list in an exclude list, NICs that are in the list are discarded before auto-detection runs.The default value is an empty string and rocSHMEM auto-detects the most appropriate NIC.
|
`` `` |
Example value:
bnxt_re1,bnxt_re11, ^mlx5_0,mlx5_3 |
ROCSHMEM_BOOTSTRAP_SOCKET_IFNAMEChooses the interface to bootstrap rocSHMEM with.
Only valid when not using MPI.
The default value is an empty string and rocSHMEM auto-detects the most appropriate interface.
|
`` `` |
Example value:
eno8303 |
ROCSHMEM_GDA_PROVIDERWhen rocSHMEM is compiled with support for multiple NIC vendors,
the environment variable selects the desired provider.
The default value is an empty string and rocSHMEM auto-detects the most appropriate NIC.
|
`` `` |
bnxt: Broadcom Thor 2pensando: AMD Pensando Pollaraionic: AMD Pensando Pollara (alias)mlx5: Mellanox ConnectX-7 |
ROCSHMEM_GDA_ALTERNATE_QP_PORTSEnables or disables alternating QP mappings across rocSHMEM contexts.
|
|
0: Disabled.1: Enabled. This helps saturate bandwidth on multiport bonded interfaces. |
ROCSHMEM_GDA_TRAFFIC_CLASSWhen using an NIC with an Ethernet link layer, this sets the traffic class for the QPs.
|
|
The traffic class number. |
ROCSHMEM_GDA_PCIE_RELAXED_ORDERINGEnables PCIe Relaxed Ordering when registering the symmetric heap with the RDMA NICs.
|
|
0: Disabled.1: Enabled. |
ROCSHMEM_GDA_ENABLE_DMABUFEnable dmabuf support for memory registration.
|
|
0: Disabled.1: Enabled. |
ROCSHMEM_GDA_ALLTOALLV_WG_ALGOSelects between two algorithms to use for GDA based alltoallv.
The GET algorithm uses an initial round of alltoallv
communication to distribute displacements then a second round to
get transfer data. This algorithm has a higher latency but
has better performance for large messages.
The COPY algorithm does an alltoallv communication
pattern into a staging buffer then does a copy into the destination
buffers. This reduces latency but requires more memory, this
algorithm only works for small messages.
|
|
GET: GET-based alltoallv algorithmCOPY: Copy alltoallv algorithm |
ROCSHMEM_GDA_OVERRIDE_NIC_FIRMWARE_CHECKThis environment variable should be used with caution.
It overrides the NIC firmware check if
a user wants to use an unsupported NIC firmware.
If the firmware check is disabled rocSHMEM is not guaranteed to work.
|
|
0: Disabled.1: Enabled. |
ROCSHMEM_GDA_SQ_SIZEThis environment variable sets the length of the SQ for GDA.
|
|
Maximum number of Work Queue Entries (WQEs) posted on the Send Queue (SQ)
|
ROCSHMEM_GDA_NUM_QPS_PER_PE_DEFAULT_CTXSets the number of Queue Pairs (QPs) to create per PE for the default context.
|
|
Number of QPs per PE for the default context. |
ROCSHMEM_GDA_NUM_QPS_PER_PE_USR_CTXSets the number of Queue Pairs (QPs) to create per PE for each user context.
|
|
Number of QPs per PE for each user context. |
ROCSHMEM_GDA_NUM_USER_BUFFERSGDA supports
rocshmem_buffer_register and rocshmem_buffer_unregisterfor user buffers. This variable sets the number of user buffers an
application may register when using the GDA backend.
If the application uses more user buffers than what is defined with
this variable, then the behavior is undefined.
|
|
Maximum number of user buffer registrations for GDA |
ROCSHMEM_MAX_WF_BUFFERSMaximum number of wavefront buffer arrays in default context (determines size of status, return, and atomic return buffers)
|
|
|
ROCSHMEM_BOOTSTRAP_TIMEOUTBootstrap initialization timeout in seconds
|
|
|
ROCSHMEM_BOOTSTRAP_HOSTIDOverride host identifier for bootstrap. Empty string uses hostname.
|
`` `` |
|
ROCSHMEM_BOOTSTRAP_SOCKET_FAMILYSocket family for bootstrap (AF_UNSPEC, AF_INET, AF_INET6)
|
|
|
ROCSHMEM_SDMA_ENABLEDEnable or disable the SDMA transport at runtime (requires
USE_SDMA build option). |
|
0: Disabled. All transfers use GPU load/store (IPC path).1: Enabled. Transfers at or above ROCSHMEM_SDMA_THRESHOLD use the SDMA engine. |
ROCSHMEM_SDMA_THRESHOLDMinimum transfer size in bytes to route through the SDMA engine.
Transfers smaller than this threshold use GPU load/store instead.
|
|
Size in bytes. |
ROCSHMEM_SDMA_NUM_CHANNELSNumber of SDMA channels (ring buffers) allocated per destination PE.
More channels reduce CAS contention when many wavefronts submit concurrently,
at the cost of additional queue memory and SDMA engine resources.
|
|
Number of channels per destination PE. |
ROCSHMEM_SDMA_SPREAD_CHANNELSWhen enabled, each wavefront within a workgroup selects its SDMA channel
using an offset based on its wavefront index:
effective_channel = (ctx_channel + wf_id) % num_channels.This reduces CAS contention when multiple wavefronts in the same workgroup
target the same destination PE on a shared context.
By default, spreading is automatically enabled only for the default context
(
ctx_id=0), which is shared by all workgroups when contexts are notcreated per-workgroup. Per-workgroup contexts (created via
rocshmem_wg_ctx_create) already distribute workgroups across channelsvia their context ID; enabling spreading for them reshuffles contention
without reducing it.
|
|
0: Apply wf_id spreading only for the default (shared) context.1: Apply wf_id spreading for all contexts. |