Collective Communication Operations#
Rocprofiler SDK Developer API 0.5.0
ROCm Profiling API and tools
|
Collective communication operations must be called separately for each communicator in a communicator clique. More...
Functions | |
ncclResult_t | ncclReduce (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) |
Reduce. | |
ncclResult_t | ncclBcast (void *buff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream) |
(Deprecated) Broadcast (in-place) | |
ncclResult_t | ncclBroadcast (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream) |
Broadcast. | |
ncclResult_t | ncclAllReduce (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream) |
All-Reduce. | |
ncclResult_t | ncclReduceScatter (const void *sendbuff, void *recvbuff, unsigned long recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream) |
Reduce-Scatter. | |
ncclResult_t | ncclAllGather (const void *sendbuff, void *recvbuff, unsigned long sendcount, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream) |
All-Gather. | |
ncclResult_t | ncclSend (const void *sendbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream) |
Send. | |
ncclResult_t | ncclRecv (void *recvbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream) |
Receive. | |
ncclResult_t | ncclGather (const void *sendbuff, void *recvbuff, unsigned long sendcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream) |
Gather. | |
ncclResult_t | ncclScatter (const void *sendbuff, void *recvbuff, unsigned long recvcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream) |
Scatter. | |
ncclResult_t | ncclAllToAll (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream) |
All-To-All. | |
ncclResult_t | ncclAllToAllv (const void *sendbuff, const unsigned long sendcounts[], const unsigned long sdispls[], void *recvbuff, const unsigned long recvcounts[], const unsigned long rdispls[], ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream) |
All-To-Allv. | |
Detailed Description
Collective communication operations must be called separately for each communicator in a communicator clique.
They return when operations have been enqueued on the HIP stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).
Function Documentation
◆ ncclAllGather()
ncclResult_t ncclAllGather | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | sendcount, | ||
ncclDataType_t | datatype, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
All-Gather.
Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Input data array to send [out] recvbuff Data array to store the gathered result [in] sendcount Number of elements each rank sends [in] datatype Data buffer element datatype [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclAllReduce()
ncclResult_t ncclAllReduce | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
ncclRedOp_t | op, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
All-Reduce.
Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Input data array to reduce [out] recvbuff Data array to store reduced result array [in] count Number of elements in data buffer [in] datatype Data buffer element datatype [in] op Reduction operator [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclAllToAll()
ncclResult_t ncclAllToAll | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
All-To-All.
Device (i) send (j)th block of data to device (j) and be placed as (i)th block. Each block for sending/receiving has count elements, which means that recvbuff and sendbuff should have a size of nranks*count elements. In-place operation is NOT supported. It is the user's responsibility to ensure that sendbuff and recvbuff are distinct.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to send (contains blocks for each other rank) [out] recvbuff Data array to receive (contains blocks from each other rank) [in] count Number of elements to send between each pair of ranks [in] datatype Data buffer element datatype [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclAllToAllv()
ncclResult_t ncclAllToAllv | ( | const void * | sendbuff, |
const unsigned long | sendcounts[], | ||
const unsigned long | sdispls[], | ||
void * | recvbuff, | ||
const unsigned long | recvcounts[], | ||
const unsigned long | rdispls[], | ||
ncclDataType_t | datatype, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
All-To-Allv.
Device (i) sends sendcounts[j] of data from offset sdispls[j] to device (j). At the same time, device (i) receives recvcounts[j] of data from device (j) to be placed at rdispls[j]. sendcounts, sdispls, recvcounts and rdispls are all measured in the units of datatype, not bytes. In-place operation will happen if sendbuff == recvbuff.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to send (contains blocks for each other rank) [in] sendcounts Array containing number of elements to send to each participating rank [in] sdispls Array of offsets into sendbuff for each participating rank [out] recvbuff Data array to receive (contains blocks from each other rank) [in] recvcounts Array containing number of elements to receive from each participating rank [in] rdispls Array of offsets into recvbuff for each participating rank [in] datatype Data buffer element datatype [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclBcast()
ncclResult_t ncclBcast | ( | void * | buff, |
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
int | root, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
(Deprecated) Broadcast (in-place)
Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitly in-place.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in,out] buff Input array on root to be copied to other ranks. Output array for all ranks. [in] count Number of elements in data buffer [in] datatype Data buffer element datatype [in] root Rank owning buffer to be copied to others [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclBroadcast()
ncclResult_t ncclBroadcast | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
int | root, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Broadcast.
Copies count values from sendbuff on root to recvbuff on all devices. root* is the rank (not the HIP device) where data resides before the operation is started. sendbuff may be NULL on ranks other than root. In-place operation will happen if sendbuff* == recvbuff.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to copy (if root). May be NULL for other ranks [in] recvbuff Data array to store received array [in] count Number of elements in data buffer [in] datatype Data buffer element datatype [in] root Rank of broadcast root [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclGather()
ncclResult_t ncclGather | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | sendcount, | ||
ncclDataType_t | datatype, | ||
int | root, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Gather.
Root device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount. recvbuff* may be NULL on ranks other than root.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to send [out] recvbuff Data array to receive into on root. [in] sendcount Number of elements to send per rank [in] datatype Data buffer element datatype [in] root Rank that receives data from all other ranks [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclRecv()
ncclResult_t ncclRecv | ( | void * | recvbuff, |
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
int | peer, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Receive.
Receive data from rank peer into recvbuff. Rank peer needs to call ncclSend with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/ ncclGroupEnd section.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[out] recvbuff Data array to receive [in] count Number of elements to receive [in] datatype Data buffer element datatype [in] peer Peer rank to send to [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclReduce()
ncclResult_t ncclReduce | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
ncclRedOp_t | op, | ||
int | root, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Reduce.
Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff* may be NULL on all calls except for root device. root* is the rank (not the HIP device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Local device data buffer to be reduced [out] recvbuff Data buffer where result is stored (only for root rank). May be null for other ranks. [in] count Number of elements in every send buffer [in] datatype Data buffer element datatype [in] op Reduction operator type [in] root Rank where result data array will be stored [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclReduceScatter()
ncclResult_t ncclReduceScatter | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | recvcount, | ||
ncclDataType_t | datatype, | ||
ncclRedOp_t | op, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Reduce-Scatter.
Reduces data in sendbuff using op operation and leaves reduced result scattered over the devices so that recvbuff on rank i will contain the i-th block of the result. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Input data array to reduce [out] recvbuff Data array to store reduced result subarray [in] recvcount Number of elements each rank receives [in] datatype Data buffer element datatype [in] op Reduction operator [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclScatter()
ncclResult_t ncclScatter | ( | const void * | sendbuff, |
void * | recvbuff, | ||
unsigned long | recvcount, | ||
ncclDataType_t | datatype, | ||
int | root, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Scatter.
Scattered over the devices so that recvbuff on rank i will contain the i-th block of the data on root. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to send (on root rank). May be NULL on other ranks. [out] recvbuff Data array to receive partial subarray into [in] recvcount Number of elements to receive per rank [in] datatype Data buffer element datatype [in] root Rank that scatters data to all other ranks [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
◆ ncclSend()
ncclResult_t ncclSend | ( | const void * | sendbuff, |
unsigned long | count, | ||
ncclDataType_t | datatype, | ||
int | peer, | ||
ncclComm_t | comm, | ||
hipStream_t | stream | ||
) |
Send.
Send data from sendbuff to rank peer. Rank peer needs to call ncclRecv with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart / ncclGroupEnd section.
- Returns
- Result code. See Result Codes for more details.
- Parameters
-
[in] sendbuff Data array to send [in] count Number of elements to send [in] datatype Data buffer element datatype [in] peer Peer rank to send to [in] comm Communicator group object to execute on [in] stream HIP stream to execute collective on
Generated by 1.9.8