Collective Communication Operations

Collective Communication Operations#

Rocprofiler SDK Developer API: Collective Communication Operations
Rocprofiler SDK Developer API 0.5.0
ROCm Profiling API and tools
Collective Communication Operations

Collective communication operations must be called separately for each communicator in a communicator clique. More...

Functions

ncclResult_t ncclReduce (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream)
 Reduce.
 
ncclResult_t ncclBcast (void *buff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
 (Deprecated) Broadcast (in-place)
 
ncclResult_t ncclBroadcast (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
 Broadcast.
 
ncclResult_t ncclAllReduce (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)
 All-Reduce.
 
ncclResult_t ncclReduceScatter (const void *sendbuff, void *recvbuff, unsigned long recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)
 Reduce-Scatter.
 
ncclResult_t ncclAllGather (const void *sendbuff, void *recvbuff, unsigned long sendcount, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
 All-Gather.
 
ncclResult_t ncclSend (const void *sendbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)
 Send.
 
ncclResult_t ncclRecv (void *recvbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)
 Receive.
 
ncclResult_t ncclGather (const void *sendbuff, void *recvbuff, unsigned long sendcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
 Gather.
 
ncclResult_t ncclScatter (const void *sendbuff, void *recvbuff, unsigned long recvcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
 Scatter.
 
ncclResult_t ncclAllToAll (const void *sendbuff, void *recvbuff, unsigned long count, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
 All-To-All.
 
ncclResult_t ncclAllToAllv (const void *sendbuff, const unsigned long sendcounts[], const unsigned long sdispls[], void *recvbuff, const unsigned long recvcounts[], const unsigned long rdispls[], ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
 All-To-Allv.
 

Detailed Description

Collective communication operations must be called separately for each communicator in a communicator clique.

They return when operations have been enqueued on the HIP stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).

Function Documentation

◆ ncclAllGather()

ncclResult_t ncclAllGather ( const void *  sendbuff,
void *  recvbuff,
unsigned long  sendcount,
ncclDataType_t  datatype,
ncclComm_t  comm,
hipStream_t  stream 
)

All-Gather.

Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffInput data array to send
[out]recvbuffData array to store the gathered result
[in]sendcountNumber of elements each rank sends
[in]datatypeData buffer element datatype
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclAllReduce()

ncclResult_t ncclAllReduce ( const void *  sendbuff,
void *  recvbuff,
unsigned long  count,
ncclDataType_t  datatype,
ncclRedOp_t  op,
ncclComm_t  comm,
hipStream_t  stream 
)

All-Reduce.

Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffInput data array to reduce
[out]recvbuffData array to store reduced result array
[in]countNumber of elements in data buffer
[in]datatypeData buffer element datatype
[in]opReduction operator
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclAllToAll()

ncclResult_t ncclAllToAll ( const void *  sendbuff,
void *  recvbuff,
unsigned long  count,
ncclDataType_t  datatype,
ncclComm_t  comm,
hipStream_t  stream 
)

All-To-All.

Device (i) send (j)th block of data to device (j) and be placed as (i)th block. Each block for sending/receiving has count elements, which means that recvbuff and sendbuff should have a size of nranks*count elements. In-place operation is NOT supported. It is the user's responsibility to ensure that sendbuff and recvbuff are distinct.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to send (contains blocks for each other rank)
[out]recvbuffData array to receive (contains blocks from each other rank)
[in]countNumber of elements to send between each pair of ranks
[in]datatypeData buffer element datatype
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclAllToAllv()

ncclResult_t ncclAllToAllv ( const void *  sendbuff,
const unsigned long  sendcounts[],
const unsigned long  sdispls[],
void *  recvbuff,
const unsigned long  recvcounts[],
const unsigned long  rdispls[],
ncclDataType_t  datatype,
ncclComm_t  comm,
hipStream_t  stream 
)

All-To-Allv.

Device (i) sends sendcounts[j] of data from offset sdispls[j] to device (j). At the same time, device (i) receives recvcounts[j] of data from device (j) to be placed at rdispls[j]. sendcounts, sdispls, recvcounts and rdispls are all measured in the units of datatype, not bytes. In-place operation will happen if sendbuff == recvbuff.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to send (contains blocks for each other rank)
[in]sendcountsArray containing number of elements to send to each participating rank
[in]sdisplsArray of offsets into sendbuff for each participating rank
[out]recvbuffData array to receive (contains blocks from each other rank)
[in]recvcountsArray containing number of elements to receive from each participating rank
[in]rdisplsArray of offsets into recvbuff for each participating rank
[in]datatypeData buffer element datatype
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclBcast()

ncclResult_t ncclBcast ( void *  buff,
unsigned long  count,
ncclDataType_t  datatype,
int  root,
ncclComm_t  comm,
hipStream_t  stream 
)

(Deprecated) Broadcast (in-place)

Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitly in-place.

Returns
Result code. See Result Codes for more details.
Parameters
[in,out]buffInput array on root to be copied to other ranks. Output array for all ranks.
[in]countNumber of elements in data buffer
[in]datatypeData buffer element datatype
[in]rootRank owning buffer to be copied to others
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclBroadcast()

ncclResult_t ncclBroadcast ( const void *  sendbuff,
void *  recvbuff,
unsigned long  count,
ncclDataType_t  datatype,
int  root,
ncclComm_t  comm,
hipStream_t  stream 
)

Broadcast.

Copies count values from sendbuff on root to recvbuff on all devices. root* is the rank (not the HIP device) where data resides before the operation is started. sendbuff may be NULL on ranks other than root. In-place operation will happen if sendbuff* == recvbuff.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to copy (if root). May be NULL for other ranks
[in]recvbuffData array to store received array
[in]countNumber of elements in data buffer
[in]datatypeData buffer element datatype
[in]rootRank of broadcast root
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclGather()

ncclResult_t ncclGather ( const void *  sendbuff,
void *  recvbuff,
unsigned long  sendcount,
ncclDataType_t  datatype,
int  root,
ncclComm_t  comm,
hipStream_t  stream 
)

Gather.

Root device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount. recvbuff* may be NULL on ranks other than root.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to send
[out]recvbuffData array to receive into on root.
[in]sendcountNumber of elements to send per rank
[in]datatypeData buffer element datatype
[in]rootRank that receives data from all other ranks
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclRecv()

ncclResult_t ncclRecv ( void *  recvbuff,
unsigned long  count,
ncclDataType_t  datatype,
int  peer,
ncclComm_t  comm,
hipStream_t  stream 
)

Receive.

Receive data from rank peer into recvbuff. Rank peer needs to call ncclSend with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/ ncclGroupEnd section.

Returns
Result code. See Result Codes for more details.
Parameters
[out]recvbuffData array to receive
[in]countNumber of elements to receive
[in]datatypeData buffer element datatype
[in]peerPeer rank to send to
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclReduce()

ncclResult_t ncclReduce ( const void *  sendbuff,
void *  recvbuff,
unsigned long  count,
ncclDataType_t  datatype,
ncclRedOp_t  op,
int  root,
ncclComm_t  comm,
hipStream_t  stream 
)

Reduce.

Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff* may be NULL on all calls except for root device. root* is the rank (not the HIP device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffLocal device data buffer to be reduced
[out]recvbuffData buffer where result is stored (only for root rank). May be null for other ranks.
[in]countNumber of elements in every send buffer
[in]datatypeData buffer element datatype
[in]opReduction operator type
[in]rootRank where result data array will be stored
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclReduceScatter()

ncclResult_t ncclReduceScatter ( const void *  sendbuff,
void *  recvbuff,
unsigned long  recvcount,
ncclDataType_t  datatype,
ncclRedOp_t  op,
ncclComm_t  comm,
hipStream_t  stream 
)

Reduce-Scatter.

Reduces data in sendbuff using op operation and leaves reduced result scattered over the devices so that recvbuff on rank i will contain the i-th block of the result. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffInput data array to reduce
[out]recvbuffData array to store reduced result subarray
[in]recvcountNumber of elements each rank receives
[in]datatypeData buffer element datatype
[in]opReduction operator
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclScatter()

ncclResult_t ncclScatter ( const void *  sendbuff,
void *  recvbuff,
unsigned long  recvcount,
ncclDataType_t  datatype,
int  root,
ncclComm_t  comm,
hipStream_t  stream 
)

Scatter.

Scattered over the devices so that recvbuff on rank i will contain the i-th block of the data on root. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to send (on root rank). May be NULL on other ranks.
[out]recvbuffData array to receive partial subarray into
[in]recvcountNumber of elements to receive per rank
[in]datatypeData buffer element datatype
[in]rootRank that scatters data to all other ranks
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on

◆ ncclSend()

ncclResult_t ncclSend ( const void *  sendbuff,
unsigned long  count,
ncclDataType_t  datatype,
int  peer,
ncclComm_t  comm,
hipStream_t  stream 
)

Send.

Send data from sendbuff to rank peer. Rank peer needs to call ncclRecv with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart / ncclGroupEnd section.

Returns
Result code. See Result Codes for more details.
Parameters
[in]sendbuffData array to send
[in]countNumber of elements to send
[in]datatypeData buffer element datatype
[in]peerPeer rank to send to
[in]commCommunicator group object to execute on
[in]streamHIP stream to execute collective on