Comparing ROCprofiler-SDK to other ROCm profiling tools#

ROCprofiler-SDK is an improved version of ROCm profiling tools that enables more efficient implementations and better thread safety while avoiding problems that plague the former implementations of ROCProfiler and ROCTracer. Here are the distinct ROCprofiler-SDK features, which also highlight the improvements over ROCProfiler and ROCTracer:

  • Improved tool initialization

  • Support for simultaneous use of the same services by multiple tools

  • Simplified control of one or more data collection services

  • Improved error checking and logging

  • Backward ABI compatibility

  • PC sampling (beta implementation)

The former implementations allow a tool to access any of the services provided by ROCProfiler or ROCTracer, such as API tracing and kernel tracing, by calling roctracer_init() when an ROCm runtime is initially loaded. As the calling tool is not required to specify during initialization, the services it needs to use, the libraries must be effectively prepared for any service to be available anytime. This behavior introduces unnecessary overhead and makes thread-safe data management difficult, as tools generally don’t use all the available services. For example, ROCTracer always installs wrappers around every runtime API and adds indirection overhead through the ROCTracer library to check for the current service configuration in a thread-safe manner.

ROCprofiler-SDK introduces context to solve the preceding issues. Contexts are effectively bundles of service configurations. ROCprofiler-SDK provides a single opportunity for a tool to create as many contexts as required. A tool can group all services into one context, create one context per service, or choose a mix. This change in the design allows ROCprofiler-SDK to be aware of the services that might be requested by a tool at any given time. The design change empowers ROCprofiler-SDK to:

  • Avoid unnecessary preparation for services that are never used. If no registered contexts request HSA API tracing, no wrappers need to be generated.

  • Perform more extensive checks during service specification and inform a tool about potential issues early.

  • Allow multiple tools to use certain services simultaneously.

  • Improve thread safety without introducing parallel bottlenecks.

  • Manage internal data and allocations more efficiently.

Comparing command-line tool options: ROCprofiler(rocprof, rocprofv2) and ROCprofiler-SDK(rocprofv3)#

ROCprofiler-SDK introduces a new command-line tool, rocprofv3, which is a more efficient and flexible version of the ROCprofiler tool.

Table 11 Comparison of ROCprofiler Command-Line Tool’s options#

Category

Feature

rocprof

rocprofv2

rocprofv3

Improvements

Notes

Basic tracing options

HIP Trace

–hip-trace

–hip-api, –hip-trace

–hip-trace

No change

rocprof and rocprofv2 –hip-trace options include kernel dispatches and memory copy activities,
which is not the case in rocprofv3

Basic tracing options

HSA Trace

–hsa-trace

–hsa-trace

–hsa-trace

No change

rocprof and rocprofv2 –hsa-trace options include kernel dispatches and memory copy activities,
which is not the case in rocprofv3

Basic tracing options

Scratch Memory Trace

Not Available

Not Available

–scratch-memory-trace

New option to trace scratch memory operations

Basic tracing options

Marker Trace(ROCTx)

–roctx-trace

–roctx-trace

–marker-trace

Improved ROCTx library with more features

Basic tracing options

Memory Copy Trace

Part of HIP and HSA Traces

Part of HIP and HSA Traces

–memory-copy-trace

Provides granularity for memory move operations

Basic tracing options

Kernel Trace

–kernel-trace

–kernel-trace

–kernel-trace

Performance improvement.

Granular tracing options

HIP runtime trace

Part of –hip-trace option

Part of –hip-trace option

–hip-runtime-trace

For collecting HIP Runtime API Traces, e.g. public HIP API functions starting with ‘hip’ (i.e. hipSetDevice).

Granular tracing options

HIP compiler trace

Not Available

Not Available

–hip-compiler-trace

For collecting HIP Compiler generated code Traces, e.g. HIP API functions starting with ‘__hip’ (i.e. __hipRegisterFatBinary).

Granular tracing options

HSA core API trace

Part of –hsa-trace option

Part of –hsa-trace option

–hsa-core-trace

New option for collecting only HSA API Traces (core API), e.g. HSA functions prefixed with only hsa_ (i.e. hsa_init)

Granular tracing options

HSA AMD trace

Part of –hsa-trace option

Part of –hsa-trace option

–hsa-amd-trace

For collecting HSA API Traces (AMD-extension API), e.g. HSA function prefixed with hsa_amd_ (i.e. hsa_amd_coherency_get_type)

Granular tracing options

HSA Image Extension trace

Part of –hsa-trace option

Part of –hsa-trace option

–hsa-image-trace

New option for collecting HSA API Traces (Image-extenson API), e.g. HSA functions prefixed with only hsa_ext_image_ (i.e. hsa_ext_image_get_capability).

Granular tracing options

HSA Finalizer trace

Part of –hsa-trace option

Part of –hsa-trace option

–hsa-finalizer-trace

New option for collecting HSA API Traces (Finalizer-extension API), e.g. HSA functions prefixed with only hsa_ext_program_ (i.e. hsa_ext_program_create)

Aggregate tracing options

Sys Trace

–sys-trace [hip-trace|hsa-trace|roctx-trace|kernel-trace]

–sys-trace [hip-trace|hsa-trace|roctx-trace|kernel-trace]

` -s, –sys-trace` [hip-trace|hsa-trace|scratch-trace|memory-copy-trace|roctx-trace|kernel-trace]

Extends the sys trace options with more features

Aggregate tracing options

Runtime Trace

Not available

Not available

` -r, –runtime-trace` [hip-runtime-trace|scratch-trace|memory-copy-trace|roctx-trace|kernel-trace]

New option to aggregate trace operations

Kernel naming options

Kernel Name Mangling

Not Available

Not Available

-M, –mangled-kernels

New option for mangled kernel names

Kernel naming options

Kernel Name Truncation

–basenames <on|off>

–basenames

-T, –truncate-kernels

New option for truncating the demangled kernel names

Kernel naming options

Kernel Rename

–roctx-rename

Not available

–kernel-rename

New option to use region names defined by roctxRangePush/roctxRangePop regions to rename the kernels

Post-processing tracing options

Statistics

–stats

Not Available

–stats

Statistics for the collected traces

Post-processing tracing options

Summary

Not available

Not available

-S, –summary

New option to output a single summary of tracing data after the profiling session

rocprof generated the post-processing step’s summary, stats, JSON, and database files with much less information.

Post-processing tracing options

Summary Per Domain

Not available

Not available

-D, –summary-per-domain

New option to output summary for each tracing domain after the profiling session

rocprof –stats option had less number of domains in the summary reports than rocprofv3

Post-processing tracing options

Summary Groups

Not available

Not available

–summary-groups REGULAR_EXPRESSION

New option to output a summary for each set of domains matching the regular expression, e.g. ‘KERNEL_DISPATCH|MEMORY_COPY’ will generate a summary from all the tracing data in the KERNEL_DISPATCH and MEMORY_COPY domains

Summary options

Summary Output File

Not available

Not available

–summary-output-file SUMMARY_OUTPUT_FILE

New option to output summary to a file, stdout, or stderr (default: stderr)

Summary options

Summary Units

Not available

Not available

-u , –summary-units

New option to output summary in desired time units {sec,msec,usec,nsec}

Display options

List Metrics

–list-basic, –list-derived

–list-counters

-L, –list-metrics

A valid YAML is supported for this option now

Perfetto-specific options

Perfetto data collection backend

Not available

Not available

–perfetto-backend {inprocess,system}

New option for perfetto data collection backend. ‘system’ mode requires starting traced and perfetto daemons

rocprofv2 used only in-process collection for perfetto plugin, However, rocprofv3 give the option to the user

Perfetto-specific options

Perfetto Buffer Size

Not available

Setting env variable rocprofiler_PERFETTO_MAX_BUFFER_SIZE_KIB to the desired buffer size

–perfetto-buffer-size {KB}

New option to define size of buffer for perfetto output in KB. default: 1 GB

Perfetto-specific options

Perfetto Buffer fill Policy

Not available

Not available

–perfetto-buffer-fill-policy {discard,ring_buffer}

New option or handling new records when perfetto has reached the buffer limit

rocprofv2 always used TraceConfig_BufferConfig_FillPolicy_RING_BUFFER fill policy.

Perfetto-specific options

Perfetto shared memory size

Not available

Not available

–perfetto-shmem-size-hint KB

New option to define perfetto shared memory size hint in KB. default: 64 KB

Filtering options

Kernel Filtration options for Counter Collection

Supported in input.xml file (supports range, gpu and kernel filtration)

kernel: <kernel_name> (can only be provided in input.txt file)

–kernel-include-regex, –kernel-exclude-regex, –kernel-iteration-range

Extensive control over output options using regular expressions

I/O options

Output Directory

-d <data directory>

-d | –output-directory

-d OUTPUT_DIRECTORY, –output-directory OUTPUT_DIRECTORY

rocprofv3 supports special keys for runtime values, e.g. %pid% gets replaced by the process ID

I/O options

Output File

-o <output file>

-o | –output-file-name

-o OUTPUT_FILE, –output-file OUTPUT_FILE

rocprofv3 supports special keys for runtime values, e.g. %pid% gets replaced by the process ID

I/O options

Logging

Minimal logging via environment variable

Minimal logging via environment variable

–log-level {fatal,error,warning,info,trace,env}

Extensive logging options

I/O options

Plugins

Not Available

plugin support for different output formats

Replaced by –output-format option

Not needed as rocprofv3 supports multiple output formats

I/O options

Output Formats

CSV, JSON (Chrome-Tracing format)

CSV, JSON (Chrome-Tracing format), Perfetto, CTF

CSV, JSON (custom schema), Perfetto, OTF2

# Multiple output formats can be supported in single run.
# OTF2 can visualize larger trace files compared to perfetto.

The Perfetto UI does not accept the JSON output format produced by rocprofv3. Perfetto is dropping support for the JSON Chrome tracing format in favor of the binary Perfetto protobuf format (.pftrace extension), which is supported by rocprofv3.

I/O options

Counter Collection

Supports input text and XML format

Only supports input text format

Input support for text, YAML and JSON formats

# Its not possible to check for valid text file. Hence rocprofv3 supports strongly typed input formats.
# YAML and JSON formats are more readable and easy to maintain.
# Allows flexibility to add more features for the tool input

I/O options

Providing Custom metrics file

-m <metric file>

-m <metric file>

Not available

Not yet in rocprofv3

Advanced options

Preload

Not Available

Not Available

–preload

Libraries to prepend to LD_PRELOAD (usually for sanitizers)

Trace Control options

Trace Period

–trace-period

-tp | –trace-period

Not available

Not yet in rocprofv3

Trace Control options

Trace start

–trace-start <on|off>

Not available

Not available

Not yet in rocprofv3

Trace Control options

Flush Interval

–flush-rate

–flush-interval

Not available

Not applicable for rocprofv3

Trace Control options

Merge Traces

–merge-traces

Not available

Not available

Not yet in rocprofv3

Legacy options

Timestamp On/Off

–timestamp <on|off>

Not available

Not available

Not applicable for rocprofv3

Legacy options

Context wait

–ctx-wait

Not available

Not available

Not applicable for rocprofv3

Legacy options

Context Limit

–ctx-limit <max number>

Not available

Not available

Not applicable for rocprofv3

Legacy options

Code Object Tracking

–obj-tracking <on|off>

Always ON in rocprofv2

Always ON in rocprofv3

Legacy options

Heartbeat

–heartbeat <rate sec>

Not available

Not available

Not applicable for rocprofv3

Timing Difference Between rocprofv3 and rocprofv1/v2#

Rocprofv3 has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.