Host memory#
Host memory is the “normal” memory residing in the host RAM and allocated by C or C++. Host memory can be allocated in two different ways:
Pageable memory
Pinned memory
The following figure explains how data is transferred in pageable and pinned memory.
The pageable and pinned memory allow you to exercise direct control over memory operations, which is known as explicit memory management. When using the unified memory, you get a simplified memory model with less control over low level memory operations.
The difference in memory transfers between explicit and unified memory management is highlighted in the following figure:
For more details on unified memory management, see Unified memory management.
Pageable memory#
Pageable memory exists on memory blocks known as “pages” that can be migrated to other memory storage. For example, migrating memory between CPU sockets on a motherboard or in a system whose RAM runs out of space and starts dumping pages into the swap partition of the hard drive.
Pageable memory is usually allocated with a call to malloc
or new
in a
C++ application.
Example: Using pageable host memory in HIP
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
host_input = new int[element_number];
host_output = new int[element_number];
// Host data preparation
for (int i = 0; i < element_number; i++) {
host_input[i] = i;
}
memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc((int **)&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
delete[] host_input;
delete[] host_output;
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
Note
hipMalloc()
and hipFree()
are blocking calls. However, HIP
also provides non-blocking versions hipMallocAsync()
and
hipFreeAsync()
, which require a stream as an additional argument.
Pinned memory#
Pinned memory or page-locked memory is stored in pages that are locked in specific sectors in RAM and can’t be migrated. The pointer can be used on both host and device. Accessing host-resident pinned memory in device kernels is generally not recommended for performance, as it can force the data to traverse the host-device interconnect such as PCIe, which is much slower than the on-device bandwidth.
The advantage of pinned memory is the improved transfer time between host and
device. For transfer operations, such as hipMemcpy()
or hipMemcpyAsync()
,
using pinned memory instead of pageable memory on the host can lead to a three times
improvement in bandwidth.
The disadvantage of pinned memory is the reduced availability of RAM for other processes, which can negatively impact the overall performance of the host.
Example: Using pinned memory in HIP
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
HIP_CHECK(hipHostMalloc((int **)&host_input, element_number * sizeof(int)));
HIP_CHECK(hipHostMalloc((int **)&host_output, element_number * sizeof(int)));
// Host data preparation
for (int i = 0; i < element_number; i++) {
host_input[i] = i;
}
memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc((int **)&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
delete[] host_input;
delete[] host_output;
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
Memory allocation flags for pinned memory#
The memory allocation for pinned memory can be controlled using hipHostMalloc
flags:
hipHostMallocPortable
: The memory allocation is not restricted to the context making the allocation.hipHostMallocMapped
: The memory is allocated into the address space for the current device and the device pointer can be obtained withhipHostGetDevicePointer()
.hipHostMallocNumaUser
: The host memory allocation follows Numa policy specified by the user. Target of Numa policy is to select a CPU that is closest to each GPU. Numa distance is the distance between GPU and CPU devices.hipHostMallocWriteCombined
: The memory is allocated as write-combined. Although lacking read efficiency by most CPUs, write-combined allocation might be transferred faster across the PCIe bus on some system configurations. It’s a good option for data transfer from host to device via mapped pinned memory.hipHostMallocCoherent
: Fine-grained memory is allocated. OverridesHIP_HOST_COHERENT
environment variable for specific allocation. For details, see Coherence control.hipHostMallocNonCoherent
: Coarse-grained memory is allocated. OverridesHIP_HOST_COHERENT
environment variable for specific allocation. For details, see Coherence control.
All allocation flags are independent and can be set in any combination. The only
exception is setting hipHostMallocCoherent
and hipHostMallocNonCoherent
together, which leads to an illegal state. An example of a valid flag
combination is calling hipHostMalloc()
with both
hipHostMallocPortable
and hipHostMallocMapped
flags set. Both the flags
use the same model and differentiate only between how the surrounding code uses
the host memory.
Note
By default, each GPU selects a Numa CPU node with the least Numa distance
between them. This implies that the host memory is automatically allocated on
the closest memory pool of the current GPU device’s Numa node. Using
hipSetDevice()
API to set a different GPU increases the Numa
distance but still allows you to access the host allocation.
Numa policy is implemented on Linux and is under development on Microsoft Windows.