ROCprofiler-SDK counter collection services#

There are two modes of counter collection service:

  • Dispatch profiling: In this mode, counters are collected on a per-kernel launch basis. This mode is useful for collecting highly detailed counters for a specific kernel execution in isolation. Note that dispatch profiling allows only a single kernel to execute in hardware at a time.

  • Agent profiling: In this mode, counters are collected on a device level. This mode is useful for collecting device level counters not tied to a specific kernel execution, which encompasses collecting counter values for a specific time range.

This topic explains how to setup dispatch and agent profiling and use common counter collection APIs. For details on the APIs including the less commonly used counter collection APIs, see the API library. For fully functional examples of both dispatch and agent profiling, see Samples.

Definitions#

Profile Config: A configuration to specify the counters to be collected on an agent. This must be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent-specific and can’t be used on different agents.

Counter ID: Unique Id (per-architecture) that specifies the counter. The counter Id can be used to fetch counter information such as its name or expression.

Instance ID: Unique record Id that encodes the counter Id and dimension for a collected value.

Dimension: Dimensions help to provide context to the raw counter values by specifying the hardware register that is the source of counter collection such as a shader engine. All counter values have dimension data encoded in their instance Id, which allows you to extract the values for individual dimensions using functions in the counter interface. The following dimensions are supported:

    ROCPROFILER_DIMENSION_XCC,            ///< XCC dimension of result
    ROCPROFILER_DIMENSION_AID,            ///< AID dimension of result
    ROCPROFILER_DIMENSION_SHADER_ENGINE,  ///< SE dimension of result
    ROCPROFILER_DIMENSION_AGENT,          ///< Agent dimension
    ROCPROFILER_DIMENSION_SHADER_ARRAY,   ///< Number of shader arrays
    ROCPROFILER_DIMENSION_WGP,            ///< Number of workgroup processors
    ROCPROFILER_DIMENSION_INSTANCE,       ///< From unspecified hardware register

Using the counter collection service#

The setup for dispatch and agent profiling is similar with only minor changes needed to adapt code from one to another. Here are the steps required to configure the counter collection services:

tool_init() setup#

Similar to tracing services, you must create a context and a buffer to collect the output when initializing the tool.

Note

Buffered_callback in rocprofiler_create_buffer is invoked with a vector of collected counter samples, when the buffer is full. For details, see the Buffered callback section.

rocprofiler_context_id_t ctx{0};
rocprofiler_buffer_id_t buff;
ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
                                            4096,
                                            2048,
                                            ROCPROFILER_BUFFER_POLICY_LOSSLESS,
                                            buffered_callback, // Callback to process data
                                            user_data,
                                            &buff),
                    "buffer creation failed");

After creating a context and buffer to store results in tool_init, it is highly recommended but not mandatory for you to construct the profiles for each agent, containing the counters for collection. Profile creation should be avoided in the time critical dispatch profiling callback as it involves validating if the counters can be collected on the agent. After profile setup, you can set up the collection service for dispatch or agent profiling. To set up either dispatch or agent profiling (only one can be used at a time), use:

    /* For Dispatch Profiling */
    // Setup the dispatch profile counting service. This service will trigger the dispatch_callback
    // when a kernel dispatch is enqueued into the HSA queue. The callback will specify what
    // counters to collect by returning a profile config id.
    ROCPROFILER_CALL(rocprofiler_configure_buffered_dispatch_counting_service(
                         ctx, buff, dispatch_callback, nullptr),
                     "Could not setup buffered service");

    /* For Agent Profiling */
    // set_profile is a callback that is use to select the profile to use when
    // the context is started. It is called at every rocprofiler_ctx_start() call.
    ROCPROFILER_CALL(rocprofiler_configure_device_counting_service(
                         ctx, buff, agent_id, set_profile, nullptr),
                     "Could not setup buffered service");

Profile setup#

  1. The first step in constructing a counter collection profile is to find the GPU agents on the machine. You must create a profile for each set of counters to be collected on every agent on the machine. You can use rocprofiler_query_available_agents to find agents on the system. The following example collects all GPU agents on the device and stores them in the vector agents:

    std::vector<rocprofiler_agent_v0_t> agents;

    // Callback used by rocprofiler_query_available_agents to return
    // agents on the device. This can include CPU agents as well. We
    // select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
    rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
                                                            const void**                agents_arr,
                                                            size_t                      num_agents,
                                                            void*                       udata) {
        if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
            throw std::runtime_error{"unexpected rocprofiler agent version"};
        auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
        for(size_t i = 0; i < num_agents; ++i)
        {
            const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
            if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
        }
        return ROCPROFILER_STATUS_SUCCESS;
    };

    // Query the agents, only a single callback is made that contains a vector
    // of all agents.
    ROCPROFILER_CALL(
        rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
                                           iterate_cb,
                                           sizeof(rocprofiler_agent_t),
                                           const_cast<void*>(static_cast<const void*>(&agents))),
        "query available agents");
  1. To identify the counters supported by an agent, query the available counters with rocprofiler_iterate_agent_supported_counters. Here is an example of a single agent returning the available counters in gpu_counters:

    std::vector<rocprofiler_counter_id_t> gpu_counters;

    // Iterate all the counters on the agent and store them in gpu_counters.
    ROCPROFILER_CALL(rocprofiler_iterate_agent_supported_counters(
                         agent,
                         [](rocprofiler_agent_id_t,
                            rocprofiler_counter_id_t* counters,
                            size_t                    num_counters,
                            void*                     user_data) {
                             std::vector<rocprofiler_counter_id_t>* vec =
                                 static_cast<std::vector<rocprofiler_counter_id_t>*>(user_data);
                             for(size_t i = 0; i < num_counters; i++)
                             {
                                 vec->push_back(counters[i]);
                             }
                             return ROCPROFILER_STATUS_SUCCESS;
                         },
                         static_cast<void*>(&gpu_counters)),
                     "Could not fetch supported counters");
  1. rocprofiler_counter_id_t is a handle to a counter. To fetch information about the counter such as its name, use rocprofiler_query_counter_info:

    for(auto& counter : gpu_counters)
    {
        // Contains name and other attributes about the counter.
        // See API documentation for more info on the contents of this struct.
        rocprofiler_counter_info_v0_t version;
        ROCPROFILER_CALL(
            rocprofiler_query_counter_info(
                counter, ROCPROFILER_COUNTER_INFO_VERSION_0, static_cast<void*>(&version)),
            "Could not query info for counter");
    }
  1. After identifying the counters to be collected, construct a profile by passing a list of these counters to rocprofiler_create_profile_config.

    // Create and return the profile
    rocprofiler_profile_config_id_t profile;
    ROCPROFILER_CALL(rocprofiler_create_profile_config(
                         agent, counters_array, counters_array_count, &profile),
                     "Could not construct profile cfg");
  1. You can use the created profile for both dispatch and agent counter collection services.

Note

Points to note on profile behavior:

  • Profile created is only valid for the agent it was created for.

  • Profiles are immutable. To collect a new counter set, construct a new profile.

  • A single profile can be used multiple times on the same agent.

  • Counter Ids supplied to rocprofiler_create_profile_config are agent-specific and can’t be used to construct profiles for other agents.

Dispatch profiling callback#

When a kernel is dispatched, a dispatch callback is issued to the tool to allow selection of counters to be collected for the dispatch by supplying a profile.

void
dispatch_callback(rocprofiler_dispatch_counting_service_data_t dispatch_data,
                  rocprofiler_profile_config_id_t*             config,
                  rocprofiler_user_data_t* user_data,
                  void* /*callback_data_args*/)

dispatch_data contains information about the dispatch being launched such as its name. config is used by the tool to specify the profile, which allows counter collection for the dispatch. If no profile is supplied, no counters are collected for this dispatch. user_data contains user data supplied to rocprofiler_configure_buffered_dispatch_profile_counting_service.

Agent set profile callback#

This callback is invoked after the context starts and allows the tool to specify the profile to be used.

void
set_profile(rocprofiler_context_id_t                 context_id,
            rocprofiler_agent_id_t                   agent,
            rocprofiler_agent_set_profile_callback_t set_config,
            void*)

The profile to be used for this agent is specified by calling set_config(agent, profile).

Buffered callback#

Data from collected counter values is returned through a buffered callback. The buffered callback routines are similar for dispatch and agent profiling except that some data such as kernel launch Ids is not available in agent profiling mode. Here is a sample iteration to print out counter collection data:

    for(size_t i = 0; i < num_headers; ++i)
    {
        auto* header = headers[i];
        if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
           header->kind == ROCPROFILER_COUNTER_RECORD_PROFILE_COUNTING_DISPATCH_HEADER)
        {
            // Print the returned counter data.
            auto* record =
                static_cast<rocprofiler_dispatch_counting_service_record_t*>(header->payload);
            ss << "[Dispatch_Id: " << record->dispatch_info.dispatch_id
               << " Kernel_ID: " << record->dispatch_info.kernel_id
               << " Corr_Id: " << record->correlation_id.internal << ")]\n";
        }
        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
                header->kind == ROCPROFILER_COUNTER_RECORD_VALUE)
        {
            // Print the returned counter data.
            auto* record = static_cast<rocprofiler_record_counter_t*>(header->payload);
            rocprofiler_counter_id_t counter_id = {.handle = 0};

            rocprofiler_query_record_counter_id(record->id, &counter_id);

            ss << "  (Dispatch_Id: " << record->dispatch_id << " Counter_Id: " << counter_id.handle
               << " Record_Id: " << record->id << " Dimensions: [";

            for(auto& dim : counter_dimensions(counter_id))
            {
                size_t pos = 0;
                rocprofiler_query_record_dimension_position(record->id, dim.id, &pos);
                ss << "{" << dim.name << ": " << pos << "},";
            }
            ss << "] Value [D]: " << record->counter_value << "),";
        }
    }

Counter definitions#

Counters are defined in yaml format in the counter_defs.yaml file. The counter definition has the following format:

counter_name:       # Counter name
  architectures:
    gfx90a:         # Architecture name
      block:        # Block information (SQ/etc)
      event:        # Event ID (used by AQLProfile to identify counter register)
      expression:   # Formula for the counter (if derived counter)
      description:  # Per-arch description (optional)
    gfx1010:
       ...
  description:      # Description of the counter

You can separately define the counters for different architectures as shown in the preceding example for gfx90a and gfx1010. If two or more architectures share the same block, event, or expression definition, they can be specified together using “/” delimiter (“gfx90a/gfx1010:”). Hardware metrics have the elements block, event, and description defined. Derived metrics have the element expression defined and can’t have block or event defined.

Derived metrics#

Derived metrics are expressions performing computation on collected hardware metrics. These expressions produce result similar to a real hardware counter.

GPU_UTIL:
  architectures:
    gfx942/gfx941/gfx10/gfx1010/gfx1030/gfx1031/gfx11/gfx1032/gfx1102/gfx906/gfx1100/gfx1101/gfx940/gfx908/gfx90a/gfx9:
      expression: 100*GRBM_GUI_ACTIVE/GRBM_COUNT
  description: Percentage of the time that GUI is active

In the preceding example, GPU_UTIL is a derived metric that uses a mathematic expression to calculate the utilization rate of the GPU using values of two GRBM hardware counters GRBM_GUI_ACTIVE and GRBM_COUNT. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions such as reduce and accumulate.

Reduce function#

Expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum))

The reduce function reduces counter values across all dimensions such as shader engine, SIMD, and so on, to produce a single output value. This helps to collect and compare values across the entire device. Here are the common reduction operations:

  • sum: Sums to create a single output. For example, reduce(GL2C_HIT,sum) sums all GL2C_HIT hardware register values.

  • avr: Calculates the average across all dimensions.

  • min: Selects minimum value across all dimensions.

  • max: Selects the maximum value across all dimensions.

Accumulate function#

Expression: accumulate(<basic_level_counter>, <resolution>)
  • The accumulate function sums the values of a basic level counter over the specified number of cycles. The resolution parameter allows you to control the frequency of the following summing operation:

    • HIGH_RES: Sums up the basic level counter every clock cycle. Captures the value every cycle for higher accuracy, which helps in fine-grained analysis.

    • LOW_RES: Sums up the basic level counter every four clock cycles. Reduces the data points and provides less detailed summing, which helps in reducing data volume.

    • NONE: Does nothing and is equivalent to collecting basic level counter. Outputs the value of the basic level counter without performing any summing operation.

Example:

MeanOccupancyPerCU:
  architectures:
    gfx942/gfx941/gfx940:
      expression: accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM
  description: Mean occupancy per compute unit.

  • MeanOccupancyPerCU: In the preceding example, the MeanOccupancyPerCU metric calculates the mean occupancy per compute unit. It uses the accumulate function with HIGH_RES to sum the SQ_LEVEL_WAVES counter every clock cycle. This sum is then divided by the maximum value of GRBM_GUI_ACTIVE and the number of compute units CU_NUM to derive the mean occupancy.