Collective Communication Operations

Collective Communication Operations#

Rocprofiler SDK Developer API: Collective Communication Operations

Rocprofiler SDK Developer API 0.6.0

ROCm Profiling API and tools

Collective communication operations must be called separately for each communicator in a communicator clique. More...

Functions
ncclResult_t	ncclReduce (const void sendbuff, void recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream)
	Reduce.

ncclResult_t	ncclBcast (void *buff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
	(Deprecated) Broadcast (in-place)

ncclResult_t	ncclBroadcast (const void sendbuff, void recvbuff, unsigned long count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
	Broadcast.

ncclResult_t	ncclAllReduce (const void sendbuff, void recvbuff, unsigned long count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)
	All-Reduce.

ncclResult_t	ncclReduceScatter (const void sendbuff, void recvbuff, unsigned long recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)
	Reduce-Scatter.

ncclResult_t	ncclAllGather (const void sendbuff, void recvbuff, unsigned long sendcount, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
	All-Gather.

ncclResult_t	ncclSend (const void *sendbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)
	Send.

ncclResult_t	ncclRecv (void *recvbuff, unsigned long count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)
	Receive.

ncclResult_t	ncclGather (const void sendbuff, void recvbuff, unsigned long sendcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
	Gather.

ncclResult_t	ncclScatter (const void sendbuff, void recvbuff, unsigned long recvcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)
	Scatter.

ncclResult_t	ncclAllToAll (const void sendbuff, void recvbuff, unsigned long count, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
	All-To-All.

ncclResult_t	ncclAllToAllv (const void sendbuff, const unsigned long sendcounts[], const unsigned long sdispls[], void recvbuff, const unsigned long recvcounts[], const unsigned long rdispls[], ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)
	All-To-Allv.

Detailed Description

Collective communication operations must be called separately for each communicator in a communicator clique.

They return when operations have been enqueued on the HIP stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).

Function Documentation

◆ ncclAllGather()

ncclResult_t ncclAllGather	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	sendcount,
		ncclDataType_t	datatype,
		ncclComm_t	comm,
		hipStream_t	stream
	)

All-Gather.

Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Input data array to send
[out]	recvbuff	Data array to store the gathered result
[in]	sendcount	Number of elements each rank sends
[in]	datatype	Data buffer element datatype
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclAllReduce()

ncclResult_t ncclAllReduce	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		ncclRedOp_t	op,
		ncclComm_t	comm,
		hipStream_t	stream
	)

All-Reduce.

Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Input data array to reduce
[out]	recvbuff	Data array to store reduced result array
[in]	count	Number of elements in data buffer
[in]	datatype	Data buffer element datatype
[in]	op	Reduction operator
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclAllToAll()

ncclResult_t ncclAllToAll	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		ncclComm_t	comm,
		hipStream_t	stream
	)

All-To-All.

Device (i) send (j)th block of data to device (j) and be placed as (i)th block. Each block for sending/receiving has count elements, which means that recvbuff and sendbuff should have a size of nranks*count elements. In-place operation is NOT supported. It is the user's responsibility to ensure that sendbuff and recvbuff are distinct.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to send (contains blocks for each other rank)
[out]	recvbuff	Data array to receive (contains blocks from each other rank)
[in]	count	Number of elements to send between each pair of ranks
[in]	datatype	Data buffer element datatype
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclAllToAllv()

ncclResult_t ncclAllToAllv	(	const void *	sendbuff,
		const unsigned long	sendcounts[],
		const unsigned long	sdispls[],
		void *	recvbuff,
		const unsigned long	recvcounts[],
		const unsigned long	rdispls[],
		ncclDataType_t	datatype,
		ncclComm_t	comm,
		hipStream_t	stream
	)

All-To-Allv.

Device (i) sends sendcounts[j] of data from offset sdispls[j] to device (j). At the same time, device (i) receives recvcounts[j] of data from device (j) to be placed at rdispls[j]. sendcounts, sdispls, recvcounts and rdispls are all measured in the units of datatype, not bytes. In-place operation will happen if sendbuff == recvbuff.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to send (contains blocks for each other rank)
[in]	sendcounts	Array containing number of elements to send to each participating rank
[in]	sdispls	Array of offsets into sendbuff for each participating rank
[out]	recvbuff	Data array to receive (contains blocks from each other rank)
[in]	recvcounts	Array containing number of elements to receive from each participating rank
[in]	rdispls	Array of offsets into recvbuff for each participating rank
[in]	datatype	Data buffer element datatype
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclBcast()

ncclResult_t ncclBcast	(	void *	buff,
		unsigned long	count,
		ncclDataType_t	datatype,
		int	root,
		ncclComm_t	comm,
		hipStream_t	stream
	)

(Deprecated) Broadcast (in-place)

Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitly in-place.

Returns: Result code. See Result Codes for more details.

Parameters

[in,out]	buff	Input array on root to be copied to other ranks. Output array for all ranks.
[in]	count	Number of elements in data buffer
[in]	datatype	Data buffer element datatype
[in]	root	Rank owning buffer to be copied to others
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclBroadcast()

ncclResult_t ncclBroadcast	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		int	root,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Broadcast.

Copies count values from sendbuff on root to recvbuff on all devices. root* is the rank (not the HIP device) where data resides before the operation is started. sendbuff may be NULL on ranks other than root. In-place operation will happen if sendbuff* == recvbuff.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to copy (if root). May be NULL for other ranks
[in]	recvbuff	Data array to store received array
[in]	count	Number of elements in data buffer
[in]	datatype	Data buffer element datatype
[in]	root	Rank of broadcast root
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclGather()

ncclResult_t ncclGather	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	sendcount,
		ncclDataType_t	datatype,
		int	root,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Gather.

Root device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount. recvbuff* may be NULL on ranks other than root.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to send
[out]	recvbuff	Data array to receive into on root.
[in]	sendcount	Number of elements to send per rank
[in]	datatype	Data buffer element datatype
[in]	root	Rank that receives data from all other ranks
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclRecv()

ncclResult_t ncclRecv	(	void *	recvbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		int	peer,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Receive.

Receive data from rank peer into recvbuff. Rank peer needs to call ncclSend with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/ ncclGroupEnd section.

Returns: Result code. See Result Codes for more details.

Parameters

[out]	recvbuff	Data array to receive
[in]	count	Number of elements to receive
[in]	datatype	Data buffer element datatype
[in]	peer	Peer rank to send to
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclReduce()

ncclResult_t ncclReduce	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		ncclRedOp_t	op,
		int	root,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Reduce.

Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff* may be NULL on all calls except for root device. root* is the rank (not the HIP device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Local device data buffer to be reduced
[out]	recvbuff	Data buffer where result is stored (only for root rank). May be null for other ranks.
[in]	count	Number of elements in every send buffer
[in]	datatype	Data buffer element datatype
[in]	op	Reduction operator type
[in]	root	Rank where result data array will be stored
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclReduceScatter()

ncclResult_t ncclReduceScatter	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	recvcount,
		ncclDataType_t	datatype,
		ncclRedOp_t	op,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Reduce-Scatter.

Reduces data in sendbuff using op operation and leaves reduced result scattered over the devices so that recvbuff on rank i will contain the i-th block of the result. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Input data array to reduce
[out]	recvbuff	Data array to store reduced result subarray
[in]	recvcount	Number of elements each rank receives
[in]	datatype	Data buffer element datatype
[in]	op	Reduction operator
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclScatter()

ncclResult_t ncclScatter	(	const void *	sendbuff,
		void *	recvbuff,
		unsigned long	recvcount,
		ncclDataType_t	datatype,
		int	root,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Scatter.

Scattered over the devices so that recvbuff on rank i will contain the i-th block of the data on root. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements. In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to send (on root rank). May be NULL on other ranks.
[out]	recvbuff	Data array to receive partial subarray into
[in]	recvcount	Number of elements to receive per rank
[in]	datatype	Data buffer element datatype
[in]	root	Rank that scatters data to all other ranks
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

◆ ncclSend()

ncclResult_t ncclSend	(	const void *	sendbuff,
		unsigned long	count,
		ncclDataType_t	datatype,
		int	peer,
		ncclComm_t	comm,
		hipStream_t	stream
	)

Send.

Send data from sendbuff to rank peer. Rank peer needs to call ncclRecv with the same datatype and the same count as this rank. This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart / ncclGroupEnd section.

Returns: Result code. See Result Codes for more details.

Parameters

[in]	sendbuff	Data array to send
[in]	count	Number of elements to send
[in]	datatype	Data buffer element datatype
[in]	peer	Peer rank to send to
[in]	comm	Communicator group object to execute on
[in]	stream	HIP stream to execute collective on

Collective Communication Operations

Collective Communication Operations#

Functions

Detailed Description

Function Documentation

◆ ncclAllGather()

◆ ncclAllReduce()

◆ ncclAllToAll()

◆ ncclAllToAllv()

◆ ncclBcast()

◆ ncclBroadcast()

◆ ncclGather()

◆ ncclRecv()

◆ ncclReduce()

◆ ncclReduceScatter()

◆ ncclScatter()

◆ ncclSend()