Coherence control#
Memory coherence describes how memory of a specific part of the system is visible to the other parts of the system. For example, how GPU memory is visible to the CPU and vice versa. In HIP, host and device memory can be allocated with two different types of coherence:
Coarse-grained coherence: The memory is considered up-to-date only after synchronization performed using
hipDeviceSynchronize()
,hipStreamSynchronize()
, or any blocking operation that acts on the null stream such ashipMemcpy()
. To avoid the cache from being accessed by a part of the system while simultaneously being written by another, the memory is made visible only after the caches have been flushed.Fine-grained coherence: The memory is coherent even while being modified by a part of the system. Fine-grained coherence ensures that up-to-date data is visible to others regardless of kernel boundaries. This can be useful if both host and device operate on the same data.
Note
To achieve fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU or making them read-only.
Mi200 accelerator’s hardware based floating point instructions work on coarse-grained memory regions. Coarse-grained coherence is typically useful in reducing host-device interconnect communication.
To check the availability of fine- and coarse-grained memory pools, use
rocminfo
:
$ rocminfo
...
*******
Agent 1
*******
Name: AMD EPYC 7742 64-Core Processor
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
...
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
*******
Agent 9
*******
Name: gfx90a
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
The APIs, flags and respective memory coherence control are listed in the following table:
API |
Flag |
|
Coherence |
---|---|---|---|
|
|
Fine-grained |
|
|
|
Coarse-grained |
|
|
|
Coarse-grained |
|
|
|
Fine-grained |
|
|
Fine-grained |
||
|
|
Coarse-grained |
|
|
Fine-grained |
||
|
|
Coarse-grained |
1 The hipHostMalloc()
memory allocation coherence mode can be
affected by the HIP_HOST_COHERENT
environment variable, if the
hipHostMallocCoherent
, hipHostMallocNonCoherent
, and
hipHostMallocMapped
are unset. If neither these flags nor the
HIP_HOST_COHERENT
environment variable is set, or set as 0, the host memory
allocation is coarse-grained.
Note
When
hipHostMallocMapped
flag is set, the allocated host memory is fine-grained and thehipHostMallocNonCoherent
flag is ignored.Setting both the
hipHostMallocCoherent
andhipHostMallocNonCoherent
flags leads to an illegal state.
Visibility of synchronization functions#
The fine-grained coherence memory is visible at the synchronization points, however the visibility of coarse-grained memory depends on the synchronization function used. The effect and visibility of various synchronization functions on fine- and coarse-grained memory types are listed here:
HIP API |
||||
Synchronization effect |
Host waits for all commands in the specified stream to complete |
Host waits for all commands in all streams on the specified device to complete |
Host waits for the specified event to complete |
Stream waits for the specified event to complete |
Fence |
System-scope release |
System-scope release |
System-scope release |
None |
Fine-grained host memory visibility |
Yes |
Yes |
Yes |
Yes |
Coarse-grained host memory visibility |
Yes |
Yes |
Depends on the used event. |
No |
You can control the release scope for hipEvents
. By default, the GPU
performs a device-scope acquire and release operation with each recorded event.
This makes the host and device memory visible to other commands executing on the
same device.
hipEventCreateWithFlags()
: You can specify a stronger system-level
fence by creating the event with hipEventCreateWithFlags
:
hipEventReleaseToSystem
: Performs a system-scope release operation when the event is recorded. This makes both fine-grained and coarse-grained host memory visible to other agents in the system, which might also involve heavyweight operations such as cache flushing. Fine-grained memory typically uses lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus doesn’t need to usehipEventReleaseToSystem
.hipEventDisableTiming
: Events created with this flag don’t record profiling data, which significantly improves synchronization performance.