CPU-GPU cooperative computing: K-means clustering tutorial#

Modern heterogeneous systems combine CPUs and GPUs to maximize computational throughput. GPUs provide massive data-parallel performance, whereas CPUs excel at complex control logic, and latency-sensitive serial tasks. Many real-world algorithms—including unsupervised clustering—contain both parallelizable and inherently sequential components.

This tutorial demonstrates a hybrid CPU–GPU cooperative execution model using the K-means clustering algorithm, showcasing partitioned workload distribution, memory management, and performance optimization strategies in heterogeneous architectures.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. HIP supports compiling and running on Linux and Windows with AMD GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

CPU–GPU cooperative computing#

Modern computing platforms integrate central processing units (CPUs) and graphics processing units (GPUs) into unified heterogeneous systems. Each processor type offers distinct architectural strengths:

  • CPUs feature a small number of complex cores optimized for sequential execution, branching, and latency-sensitive control flow.

  • GPUs consist of thousands of simpler cores optimized for throughput and massive data-level parallelism.

Cooperative computing refers to a programming paradigm that combines these capabilities in a coordinated execution model, where both CPU and GPU perform complementary portions of a workload. Rather than treating the GPU as a passive accelerator, this approach distributes computation dynamically between devices to maximize total system utilization.

K-means clustering#

K-means is an unsupervised machine learning algorithm that partitions a dataset into \(k\) clusters (groups of similar data points) by minimizing intra-cluster variance. It iteratively refines cluster assignments and centroid (center point of a cluster) locations until convergence.

The optimization objective is defined as:

\[\min_{C_1, \dots, C_k} \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j - \mu_i\|^2\]

where \(\mu_i\) is the centroid of cluster \(C_i\).

K-means is frequently used in:

  • Customer segmentation: Grouping customers by behavior patterns

  • Image compression: Reducing color palette by clustering similar colors

  • Anomaly detection: Identifying outliers that don’t fit clusters

  • Document clustering: Organizing similar documents together

  • Feature engineering: Creating new features based on cluster membership

Algorithm#

The K-means algorithm iteratively refines a partition of a dataset into \(k\) clusters by alternating between two primary computational phases: assignment and update. Each iteration minimizes the total within-cluster variance, driving the system toward convergence where centroid movement or membership changes fall below a defined threshold.

Iterative procedure#

  1. Initialization: Select initial centroids (random or via K-means++).

  2. Assignment: Assign each data point to the nearest centroid.

  3. Update: Recalculate each centroid as the mean of its assigned members.

  4. Convergence: Repeat until centroids stabilize or a maximum iteration limit is reached.

These phases alternate until the solution converges or a maximum iteration count is reached.

Initialization#

The algorithm begins by selecting initial centroid positions. This step significantly influences both convergence rate and clustering quality.

  • Random initialization: centroids are randomly chosen from the dataset.

  • K-means++: probabilistic seeding to spread centroids across data space.

  • Domain-specific heuristics: custom initializations for structured data.

Initial centroid diversity reduces the likelihood of poor local minima and improves overall stability.

Assignment#

For each data point \(x_i\), the algorithm computes the Euclidean distance to each centroid \(\mu_j\) and assigns the point to the nearest cluster:

\[C_i = \arg \min_j \|x_i - \mu_j\|^2\]
  • Each point–centroid distance calculation is independent.

  • Ideal for SIMD/SIMT architectures (GPU execution).

  • Dominated by dense floating-point arithmetic and memory bandwidth utilization.

This phase represents the embarrassingly parallel portion of the algorithm and provides the largest opportunity for GPU acceleration.

Update#

Once all data points are assigned, new centroid positions are computed as the mean of their corresponding cluster members:

\[\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i\]
  • Requires aggregation and division per cluster.

  • Involves variable membership counts across clusters.

  • Reduction-heavy and branch-divergent.

Although reduction can be implemented on GPUs, small \(k\) values and irregular membership sizes typically make the CPU more efficient for this phase, given its superior cache hierarchy and flexible control flow.

Convergence#

After the update phase, convergence is evaluated using one of the following criteria:

  • No change in point memberships.

  • Centroid displacement below a user-defined threshold.

  • Iteration count exceeds maximum limit.

If the convergence condition is not met, the assignment and update phases repeat.

Summary of cooperative execution#

  1. GPU: performs parallel distance evaluations and membership assignments.

  2. CPU: executes centroid averaging and convergence checks.

  3. Synchronization: only essential data (centroids and membership arrays) is exchanged per iteration.

This division of labor minimizes data movement and maximizes hardware utilization. The resulting hybrid implementation combines GPU throughput for massive data-parallel operations with CPU efficiency for aggregation and control tasks, enabling scalable clustering performance on heterogeneous systems.

Implementation#

This implementation follows a hybrid CPU/GPU design to accelerate the most computationally expensive phase of K-means: assigning each data point to its nearest centroid. The GPU performs distance calculations in parallel, while the CPU handles the centroid recomputation, where sequential reductions are efficient and data sizes are small. The algorithm iterates between GPU-based membership updates and CPU-based centroid averaging until convergence or a maximum iteration count is reached.

Data Structures#

The implementation stores all data in simple, contiguous arrays for efficient memory access on both CPU and GPU. Data points and centroids are represented as flattened std::vector<float> arrays, while cluster assignments are stored as std::vector<int>.

// length: Number of data points to cluster
// dimension: Number of features per data point
// k: Number of clusters

std::vector<float> data;           // Data points (length * dimension)
std::vector<float> centroids;      // Centroid positions (k * dimension)
std::vector<int> memberships;      // Cluster assignments (length)

Main Loop#

The core K-means iteration:

// length is an integer for the number of entries to be clustered.
// dimension is an integer for the number of properties of each entry.
// k is an integer that determines the number of clusters.

std::vector<float> centroids = initializeCentroids(length * dimension, k);
std::vector<int> memberships(length, 0);

for (int iteration = 0; iteration < maxIterations; ++iteration) {
    // Determine the cluster that each entry belongs to.
    // The function returns how many entries changed membership.
    int membershipChanges = updateMembership(data, centroids, memberships);

    // Converge checking.
    if (membershipChanges == 0) {
        break;
    }

    // Calculate new centroids.
    std::vector<Point> newCentroids = centroids;
    updateCentroid(data, newCentroids, memberships);
    centroids = newCentroids;
}

Loop structure:

  1. Update memberships using GPU

  2. Check for convergence (no changes)

  3. Update centroids using CPU

  4. Repeat until convergence or max iterations

GPU membership update function#

The updateMembership function serves as the interface between CPU and GPU:

int updateMembership(float* data, float* centroids, int* membership,
                    int dataSize, int dimension, int k) {
    // gpuData is allocated and copied to the GPU earlier.

    float *gpuCentroids, *gpuMembership;

    // Allocate GPU memory
    hipMalloc(&gpuCentroids, k * dimension * sizeof(float));
    hipMalloc(&gpuMembership, dataSize * sizeof(int));

    // Copy data from CPU to GPU
    hipMemcpy(gpuCentroids, centroids, k * dimension * sizeof(float),
              hipMemcpyHostToDevice);

    // Calculate the sizes for the kernel launch
    int localSize = 256;
    int globalSize = (dataSize + localSize - 1) / localSize;

    // Launch the kernel
    updateMembershipGPU<<<globalSize, localSize>>>(
        gpuData, gpuCentroids, gpuMembership, dataSize, dimension, k);
    hipDeviceSynchronize();

    // Create CPU Data to hold the results
    std::vector<int> cpuNewMembership(dataSize);

    // Copy GPU data back to CPU
    hipMemcpy(cpuNewMembership.data(), gpuMembership,
              dataSize * sizeof(int), hipMemcpyDeviceToHost);

    // Count membership updates
    int membershipUpdate = 0;
    for (int i = 0; i < dataSize; ++i) {
        if (membership[i] != cpuNewMembership[i]) {
            membershipUpdate++;
            membership[i] = cpuNewMembership[i]; // Update the original
        }
    }

    // Free GPU memory
    hipFree(gpuCentroids);
    hipFree(gpuMembership);

    return membershipUpdate;
}

Step 1: GPU memory allocation#

float *gpuCentroids, *gpuMembership;
hipMalloc(&gpuCentroids, k * dimension * sizeof(float));
hipMalloc(&gpuMembership, dataSize * sizeof(int));

Allocate space on the GPU for:

  • Centroids: \(k\) centroids, each with dimension features

  • Membership assignments: One cluster ID per data point

Step 2: Data transfer to GPU#

hipMemcpy(gpuCentroids, centroids, k * dimension * sizeof(float),
          hipMemcpyHostToDevice);

Transfer the current centroid positions from CPU to GPU. Note that the data points (gpuData) are already on the GPU from earlier initialization.

Step 3: Kernel configuration#

int localSize = 256;
int globalSize = (dataSize + localSize - 1) / localSize;
  • localSize: 256 threads per block (common choice)

  • globalSize: Number of blocks needed to cover all data points

  • Rounding up ensures we process all data points

Step 4: Kernel launch#

updateMembershipGPU<<<globalSize, localSize>>>(
    gpuData, gpuCentroids, gpuMembership, dataSize, dimension, k);
hipDeviceSynchronize();

Launch the GPU kernel to compute cluster assignments in parallel, then wait for completion.

Step 5: Retrieve results#

std::vector<int> cpuNewMembership(dataSize);
hipMemcpy(cpuNewMembership.data(), gpuMembership,
          dataSize * sizeof(int), hipMemcpyDeviceToHost);

Copy the new membership assignments back from GPU to CPU.

Step 6: Count changes#

int membershipUpdate = 0;
for (int i = 0; i < dataSize; ++i) {
    if (membership[i] != cpuNewMembership[i]) {
        membershipUpdate++;
        membership[i] = cpuNewMembership[i];
    }
}

Count how many data points changed clusters. This value determines if the algorithm has converged.

Step 7: Cleanup#

hipFree(gpuCentroids);
hipFree(gpuMembership);
return membershipUpdate;

Free temporary GPU memory and return the change count.

Kernel implementation#

__global__ void updateMembershipGPU(
    float* data, float* centroids, int* membership,
    int dataSize, int dimension, int k)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if (tid < dataSize) {
        float minDistance = INFINITY;
        int bestCluster = 0;

        // Find nearest centroid
        for (int cluster = 0; cluster < k; ++cluster) {
            float distance = 0.0f;

            // Calculate Euclidean distance
            for (int d = 0; d < dimension; ++d) {
                float diff = data[tid * dimension + d] -
                            centroids[cluster * dimension + d];
                distance += diff * diff;
            }

            if (distance < minDistance) {
                minDistance = distance;
                bestCluster = cluster;
            }
        }

        membership[tid] = bestCluster;
    }
}

Kernel operation:

  1. Each thread processes one data point

  2. Calculates distance to all \(k\) centroids

  3. Assigns point to nearest centroid

  4. Stores result in membership array

CPU centroid update#

The CPU handles the averaging operation:

void updateCentroid(float* data, float* centroids, int* membership,
                   int dataSize, int dimension, int k)
{
    std::vector<int> counts(k, 0);
    std::vector<float> sums(k * dimension, 0.0f);

    // Accumulate sums for each cluster
    for (int i = 0; i < dataSize; ++i) {
        int cluster = membership[i];
        counts[cluster]++;

        for (int d = 0; d < dimension; ++d) {
            sums[cluster * dimension + d] += data[i * dimension + d];
        }
    }

    // Calculate averages (new centroids)
    for (int cluster = 0; cluster < k; ++cluster) {
        if (counts[cluster] > 0) {
            for (int d = 0; d < dimension; ++d) {
                centroids[cluster * dimension + d] =
                    sums[cluster * dimension + d] / counts[cluster];
            }
        }
    }
}

The centroid update requires:

  1. Reduction operation: Sum all points in each cluster.

  2. Variable-sized groups: Clusters have different numbers of points.

  3. Division: Calculate average (not easily parallelized).

While GPUs can perform reductions, for this workload:

  • The number of clusters (\(k\)) is typically small (< 100).

  • The CPU can efficiently handle this sequential aggregation.

  • Avoiding GPU complexity keeps code simpler.

Data transfer considerations#

Data already on CPU:

  • Membership array was just copied back.

  • Data points can be kept on CPU for small datasets.

  • Centroids are small (k × dimension values).

Transfer costs:

  • Only centroids need to copy back to GPU.

  • Much smaller than the full dataset.

  • Overhead justified by parallel speedup in assignment phase.

Best practices#

This section outlines recommended practices for implementing an efficient GPU-accelerated Breadth-First Search (BFS). It highlights design principles, memory-management strategies, and debugging techniques that help ensure correctness, maintainability, and high performance when mapping BFS onto modern GPU architectures.

Design principles#

  1. Identify parallelism

    Decompose the K-means workflow into independent, data-parallel kernels such as distance computation. Minimize sequential dependencies in assignment and reduction steps.

  2. Minimize host–device data movement

    Maintain dataset and intermediate buffers resident on the GPU across iterations. Transfer only convergence metrics or centroids when absolutely necessary.

  3. Batch and fuse operations

    Combine smaller kernels such as distance computation and assignment, or process multiple mini-batches per kernel launch to reduce kernel invocation and PCIe transfer overhead.

  4. Overlap communication with computation

    Utilize asynchronous HIP streams to overlap host–device memory transfers with GPU kernel execution. Employ pinned (page-locked) memory for faster DMA transfers.

Memory strategy#

  1. Persistent device allocations

    Allocate GPU memory once before iterative processing. Reuse buffers such as those for centroids, assignments, and temporary reductions to avoid the overhead of repeated hipMalloc() and hipFree() calls.

  2. Data locality optimization

    Co-locate data structures according to access frequency:

    • Store high-throughput arrays (points, centroids) in global memory with coalesced access.

    • Cache small, frequently reused values in shared memory or registers.

  3. Transfer scheduling

    Schedule host–device transfers asynchronously while the GPU executes kernels. Use HIP events or streams to synchronize only when necessary.

  4. Memory pooling and reuse

    Use memory pools such as hipMallocAsync(), hipMallocFromPoolAsync(), or custom allocators to mitigate fragmentation and reduce allocation latency, especially in iterative or batched workloads.

Performance Considerations#

Aspect

Recommendation

Large datasets

GPU acceleration scales linearly with dataset size. Prioritize keeping data on GPU to avoid PCIe bottlenecks.

Many iterations

Use persistent buffers and asynchronous reduction to minimize per-iteration synchronization overhead.

Small number of clusters (k)

Kernel execution may be underutilized. CPU execution or hybrid scheduling may be more efficient.

High-dimensional data

Distance computation dominates cost. Leverage GPU shared memory for centroid caching and ensure memory coalescing for point vectors.

When to use CPU-GPU cooperation#

  • Hybrid computational structure

    The algorithm exhibits a clear division between compute-intensive, data-parallel kernels such as distance computation in K-means and lightweight, control-heavy serial stages such as centroid updates or convergence checks.

  • High arithmetic intensity in parallel regions

    The GPU-executed portion performs substantial arithmetic operations per memory access, ensuring efficient utilization of GPU cores and minimizing the impact of memory latency.

  • Low-complexity reduction or synchronization on CPU

    The CPU phase primarily aggregates results or performs decision logic that does not justify GPU kernel execution overhead.

  • Large-scale data processing

    Dataset size exceeds the threshold where PCIe transfer costs can be amortized, enabling sustained GPU throughput and effective memory reuse.

  • Iterative or streaming workloads

    Algorithms involving multiple iterations benefit from persistent GPU memory allocations and overlapped data transfer, significantly reducing per-iteration latency.

When to avoid CPU-GPU cooperation#

  • Fully parallelizable workloads

    Algorithms with uniform, independent computations across all data points are better suited for GPU-only execution without CPU coordination overhead.

  • Low computational density per element

    If each data element requires few operations relative to transfer cost, CPU–GPU communication latency will dominate, leading to suboptimal performance.

  • High inter-phase data dependency

    Frequent synchronization or data exchange between CPU and GPU phases prevents effective pipeline overlap and leads to idle compute units.

  • Small problem sizes

    When datasets fit comfortably in CPU cache or system memory, the GPU launch overhead and transfer latency outweigh any computational gains from offloading.

Conclusion#

The K-means implementation illustrates heterogeneous workload partitioning between CPU and GPU. By mapping compute-intensive, data-parallel operations to the GPU and reduction-heavy serial logic to the CPU, total runtime is minimized while code complexity remains manageable.

This CPU–GPU cooperative execution paradigm generalizes to many algorithms combining reduction, aggregation, and distance-based computation—enabling scalable, efficient utilization of modern heterogeneous hardware.