Multi-kernel programming: breadth-first search tutorial#

Many real-world GPU workloads involve multiple kernels cooperating to solve a single problem. This tutorial explores multi-kernel GPU programming using the breadth-first search (BFS) algorithm, a foundational graph traversal method widely used in networking, path-finding, and social network analysis.

The implementation is adapted from the Rodinia benchmark suite, a well-known collection of heterogeneous computing workloads that demonstrate different parallel programming strategies.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. HIP supports compiling and running on Linux and Windows with AMD GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

Multi-kernel GPU programming#

In GPU computing, some algorithms cannot be efficiently expressed using a single kernel due to synchronization or dependency constraints. Instead, they are decomposed into multiple kernels that execute sequentially, with each kernel responsible for a specific computation phase.

This approach, called multi-kernel programming, is essential when:

  • Results from one kernel determine the input for the next.

  • Global synchronization between thread blocks is required.

  • Control flow depends on runtime conditions.

  • The algorithm involves iterative or level-wise processing.

Breadth-first search (BFS)#

Breadth-first search (BFS) is a layered graph traversal algorithm that explores nodes level by level, starting from a root node. It guarantees finding the shortest path (in edge count) to all reachable nodes in an unweighted graph.

Applications of BFS include:

  • Path-finding: Finding shortest paths between nodes.

  • Peer-to-peer networking: Network topology discovery.

  • GPS navigation: Route planning and optimization.

  • Social networks: Friend recommendations and connection analysis.

  • Web crawling: Systematic website exploration.

Algorithm characteristics#

BFS is structured as a level-synchronous algorithm:

  • Nodes in the same graph level are processed concurrently.

  • A queue (or “frontier”) tracks which nodes to explore next.

  • Each node is visited once to prevent redundant processing.

Sequential BFS is straightforward but inherently serial due to the level-by-level dependency between nodes. GPU parallelization requires restructuring the traversal to exploit data parallelism across nodes within the same frontier.

Sequential BFS algorithm#

Let’s first understand how BFS works sequentially before parallelizing it.

Consider a simple graph with four nodes:

  R (root)
 / \
A   B
 \ /
  C

Step 1: Start at the root node R

  • Mark R as visited

  • Enqueue R

  • Queue: [R]

Step 2: Process R

  • Dequeue R

  • Discover neighbors: A and B

  • Enqueue both, mark as visited

  • Queue: [A, B]

Step 3: Process A

  • Dequeue A

  • Neighbors: R (visited) and C (new)

  • Enqueue C

  • Queue: [B, C]

Step 4: Process B

  • Dequeue B

  • Neighbors: R (visited) and C (visited)

  • Queue: [C]

Step 5: Process C

  • Dequeue C

  • All neighbors visited

  • Queue becomes empty — traversal complete

Parallel BFS on GPU#

Unlike dense linear algebra, BFS is an irregular algorithm. The amount of work per node varies, and the connectivity pattern of the graph drives execution. The main challenges are:

  1. Data dependencies: nodes in the next level depend on the previous level.

  2. Irregular parallelism: each frontier may contain a very different number of nodes.

  3. Dynamic workload: the size of the next frontier is unknown at runtime.

  4. Synchronization: all nodes in one frontier must complete before the next begins.

The frontier is the set of nodes being processed at a given BFS level. Parallel BFS executes all frontier nodes simultaneously, using one thread per node to discover new neighbors and mark them for the next iteration.

Implementation strategy#

The GPU implementation performs BFS using two cooperating kernels:

  1. Kernel 1: processes all nodes in the current frontier.

  2. Kernel 2: updates the next frontier and checks if work remains.

This design provides implicit synchronization between levels while avoiding race conditions. The host (CPU) manages the iterative control loop, launching kernels repeatedly until no more frontier nodes exist.

Data structures#

The graph is represented using adjacency lists stored in arrays:

struct Node {
    int starting;       // starting index in the edge list
    int no_of_edges;    // number of outgoing edges
};

Main arrays:

  • g_graph_nodes: node array storing offsets into the edge list.

  • g_graph_edges: flattened list of edge destinations.

  • g_graph_mask: boolean array indicating active frontier nodes.

  • g_updating_graph_mask: marks nodes to be added to the next frontier.

  • g_graph_visited: tracks which nodes were visited.

  • g_graph_cost: stores the distance (edge count) from the source node.

Control flow flags:

  • g_over: device-side flag indicating whether another iteration is needed.

  • The host resets this flag each iteration and checks it after kernel execution.

The two-kernel approach#

The two-kernel structure ensures correctness and efficient synchronization:

  • Exploration kernel (Kernel 1) discovers new nodes.

  • Update kernel (Kernel 2) finalizes state for the next iteration.

This separation:

  • Avoids race conditions between threads of different levels.

  • Provides synchronization between BFS levels.

  • Keeps control logic simple on the host side.

Kernel 1: process current frontier#

Each thread processes one node from the current frontier, examining all of its outgoing edges:

__global__ void Kernel1(
    Node* g_graph_nodes,
    int* g_graph_edges,
    bool* g_graph_mask,
    bool* g_updating_graph_mask,
    bool* g_graph_visited,
    int* g_graph_cost,
    int no_of_nodes)
{
    int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;

    if (tid < no_of_nodes && g_graph_mask[tid]) {
        g_graph_mask[tid] = false;

        for (int i = g_graph_nodes[tid].starting;
             i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges;
             i++) {
            int id = g_graph_edges[i];
            if (!g_graph_visited[id]) {
                g_graph_cost[id] = g_graph_cost[tid] + 1;
                g_updating_graph_mask[id] = true;
            }
        }
    }
}

Kernel 1 responsibilities:

  • Clear the node’s mask (mark processed).

  • Explore all edges.

  • For each unvisited neighbor:

    • Compute cost (distance).

    • Add to the next frontier.

Kernel 2: update frontier#

This kernel finalizes the next frontier:

__global__ void Kernel2(
    bool* g_graph_mask,
    bool* g_updating_graph_mask,
    bool* g_graph_visited,
    bool* g_over,
    int no_of_nodes)
{
    int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;

    if (tid < no_of_nodes && g_updating_graph_mask[tid]) {
        g_graph_mask[tid] = true;
        g_graph_visited[tid] = true;
        *g_over = true;
        g_updating_graph_mask[tid] = false;
    }
}

Kernel 2 responsibilities:

  • Move newly discovered nodes into the active frontier.

  • Mark them as visited.

  • Signal continuation via *g_over.

Host-side control loop#

do {
    h_over = false;
    hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice);

    Kernel1<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
        d_graph_nodes, d_graph_edges, d_graph_mask,
        d_graph_updating_graph_mask, d_graph_visited,
        d_graph_cost, no_of_nodes);
    hipDeviceSynchronize();

    Kernel2<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
        d_graph_mask, d_graph_updating_graph_mask,
        d_graph_visited, d_over, no_of_nodes);
    hipDeviceSynchronize();

    hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost);
} while (h_over);

The loop exits when no new nodes are discovered. g_over or h_over on host side remains false after one full iteration.

Performance Characteristics#

Parallelism Patterns#

Within each iteration:

  • High parallelism: All frontier nodes processed simultaneously

  • Work distribution: One thread per node

Across iterations:

  • Sequential: Must complete one level before starting the next

  • Variable parallelism: Different levels may have different numbers of nodes

Workload Characteristics#

Characteristic

Description

Irregular

Frontier size varies dramatically across levels

Data-dependent

Graph structure determines parallel work available

Dynamic

Cannot predict workload statically

Memory-bound

Many memory accesses per computation

Best practices#

This section outlines recommended practices for implementing an efficient GPU-accelerated Breadth-First Search (BFS). It highlights design principles, memory-management strategies, and debugging techniques that help ensure correctness, maintainability, and high performance when mapping BFS onto modern GPU architectures.

Design principles#

  1. Define clear kernel roles

    Decompose BFS into well-defined GPU kernels, each responsible for a specific phase of computation. For example:

    • Kernel 1: frontier expansion (discovering new nodes)

    • Kernel 2: frontier update (marking next-level nodes)

    This separation simplifies synchronization and ensures that each kernel operates on independent data regions.

  2. Minimize host–device communication

    Keep graph data structures (nodes, edges, masks) resident on the GPU across iterations. Only transfer lightweight control flags such as g_over to the host each loop iteration to check termination conditions.

  3. Kernel boundaries as synchronization points

    Kernel launch boundaries on the same stream naturally enforce global synchronization across all threads on the GPU. Each kernel invocation completes before the next begins, ensuring that:

    • All nodes in the current frontier are fully processed before updating the next frontier.

    • Memory updates to arrays like g_graph_cost or g_graph_mask are visible to all threads in subsequent kernels.

    This avoids the need for costly device-wide barriers or explicit synchronization primitives within a single kernel. Leverage kernel sequencing to structure iterative algorithms cleanly—each kernel represents one computation phase per BFS level.

  4. Flag-based control

    Use device-side flags for dynamic termination and conditional control flow. In BFS, the Boolean flag g_over serves as a device-to-host signal indicating whether new nodes were discovered during the current iteration.

    • Initialize g_over to false on the host at the start of each iteration.

    • Allow GPU threads in Kernel 2 to set *g_over = true when adding new nodes to the next frontier.

    • After kernel completion, copy the flag back to the host using hipMemcpy(). If g_over remains false, the traversal is complete.

    This mechanism avoids repeated host intervention and enables a tight CPU–GPU control loop that dynamically adapts to workload size without transferring large data structures.

Memory strategy#

  1. Persistent device allocations

    Allocate all required device buffers once prior to traversal. Reuse these allocations across multiple BFS runs or multiple source nodes to minimize the overhead of repeated hipMalloc() and hipFree() calls.

  2. Minimize host–device communication

    Keep graph data structures (nodes, edges, masks) resident on the GPU across iterations. Only transfer lightweight control flags such as g_over to the host each loop iteration to check termination conditions.

  3. Use pinned host memory for control flags

    When copying g_over or other control signals between host and device, allocate host memory using pinned (page-locked) buffers to accelerate DMA transfers.

Debugging and validation#

  1. Frontier validation

    After each iteration, verify the number of nodes marked in g_graph_mask. Unexpected empty or overfull frontiers often indicate incorrect synchronization or uninitialized masks.

  2. Termination condition check

    Confirm that the host-side loop terminates when g_over remains false for one iteration. If the loop never ends, ensure g_over is reset on the host before each kernel launch.

  3. Result verification

    Compare computed distances in g_graph_cost against a CPU reference implementation for small graphs to validate correctness.

  4. Profiling and bottleneck detection

    Use tools such as rocprofv3 or ROCm compute profiler to measure per-kernel execution times, memory throughput, and synchronization overhead.

  5. Logging and debug builds

    Enable optional logging for iteration counts, frontier sizes, and synchronization states during development. Disable logging in production builds to avoid performance impact.

Conclusion#

Multi-kernel GPU programming is essential for complex algorithms that require:

  • Multiple phases of computation.

  • Data dependencies between phases.

  • Dynamic control flow based on intermediate results.

The BFS example demonstrates:

  • How to decompose algorithms into multiple cooperating kernels.

  • Techniques for managing frontiers and iterative processing.

  • Strategies for handling irregular and dynamic parallelism.

  • Proper synchronization between kernel launches.

Key takeaways:

  1. Kernel boundaries provide synchronization: Use them strategically to ensure correctness.

  2. Separate exploration from update: Prevents race conditions in level-based algorithms.

  3. Host controls iteration: CPU manages the overall loop while GPU does heavy lifting.

  4. Flags enable dynamic control: Device-side flags allow work-dependent termination.

Understanding multi-kernel patterns enables developers to implement sophisticated algorithms like graph processing, dynamic programming, and iterative refinement methods efficiently on GPUs.