MI300 and MI200 series performance counters and metrics#
2024-09-11
52 min read time
This document lists and describes the hardware performance counters and derived metrics available for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the ROCProfiler tool.
MI300 and MI200 series performance counters#
Series performance counters include the following categories:
The following sections provide additional details for each category.
Note
Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with an asterisk (*) require further evaluation.
Command processor counters#
Command processor counters are further classified into command processor-fetcher and command processor-compute.
Command processor-fetcher counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Number of cycles one of the compute unified translation caches (L1) is stalled waiting on translation |
|
Cycles |
Number of cycles command processor-fetcher is busy |
|
Cycles |
Number of cycles command processor-fetcher is idle |
|
Cycles |
Number of cycles command processor-fetcher is stalled |
|
Cycles |
Number of cycles command processor-fetcher texture cache interface unit interface is busy |
|
Cycles |
Number of cycles command processor-fetcher texture cache interface unit interface is idle |
|
Cycles |
Number of cycles command processor-fetcher texture cache interface unit interface is stalled waiting on free tags |
The texture cache interface unit is the interface between the command processor and the memory system.
Command processor-compute counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Number of cycles command processor-compute micro engine is busy decoding packets |
|
Cycles |
Number of cycles one of the unified translation caches (L1) is stalled waiting on translation |
|
Cycles |
Number of cycles command processor-compute is busy |
|
Cycles |
Number of cycles command processor-compute is idle |
|
Cycles |
Number of cycles command processor-compute is stalled |
|
Cycles |
Number of cycles command processor-compute texture cache interface unit interface is busy |
|
Cycles |
Number of cycles command processor-compute texture cache interface unit interface is idle |
|
Cycles |
Number of cycles command processor-compute unified translation cache (L2) interface is busy |
|
Cycles |
Number of cycles command processor-compute unified translation cache (L2) interface is idle |
|
Cycles |
Number of cycles command processor-compute unified translation cache (L2) interface is stalled |
|
Cycles |
Number of cycles command processor-compute micro engine processor is busy |
The micro engine runs packet-processing firmware on the command processor-compute counter.
Graphics register bus manager counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Number of free-running GPU cycles |
|
Cycles |
Number of GPU active cycles |
|
Cycles |
Number of cycles any of the command processor blocks are busy |
|
Cycles |
Number of cycles any of the shader processor input is busy in the shader engines |
|
Cycles |
Number of cycles any of the texture addressing unit is busy in the shader engines |
|
Cycles |
Number of cycles any of the texture cache blocks are busy |
|
Cycles |
Number of cycles the command processor-compute is busy |
|
Cycles |
Number of cycles the command processor-fetcher is busy |
|
Cycles |
Number of cycles the unified translation cache (Level 2 [L2]) block is busy |
|
Cycles |
Number of cycles the efficiency arbiter block is busy |
Texture cache blocks include:
Texture cache arbiter
Texture cache per pipe, also known as vector Level 1 (L1) cache
Texture cache per channel, also known as known as L2 cache
Texture cache interface
Shader processor input counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Number of cycles with outstanding waves |
|
Cycles |
Number of cycles enabled by |
|
Workgroups |
Number of dispatched workgroups |
|
Wavefronts |
Number of dispatched wavefronts |
|
Cycles |
Number of arbiter cycles with requests but no allocation |
|
Cycles |
Number of arbiter cycles with compute shader (nth pipe) requests but no compute shader (nth pipe) allocation |
|
Cycles |
Number of arbiter stall cycles due to shortage of compute shader (nth pipe) pipeline slots |
|
Cycles |
Number of stall cycles due to shortage of temp space |
|
SIMD-cycles |
Accumulated number of single instruction, multiple data (SIMD) per cycle affected by shortage of wave slots for compute shader (nth pipe) wave dispatch |
|
SIMD-cycles |
Accumulated number of SIMDs per cycle affected by shortage of vector general-purpose register (VGPR) slots for compute shader (nth pipe) wave dispatch |
|
SIMD-cycles |
Accumulated number of SIMDs per cycle affected by shortage of scalar general-purpose register (SGPR) slots for compute shader (nth pipe) wave dispatch |
|
CU |
Number of compute units affected by shortage of local data share (LDS) space for compute shader (nth pipe) wave dispatch |
|
CU |
Number of compute units with compute shader (nth pipe) waves waiting at a BARRIER |
|
CU |
Number of compute units with compute shader (nth pipe) waves waiting for BULKY resource |
|
Cycles |
Number of compute shader (nth pipe) wave stall cycles due to restriction of |
|
Cycles |
Number of cycles compute shader (nth pipe) is stalled due to |
|
Qcycles |
Number of quad-cycles taken to initialize VGPRs when launching waves |
|
Qcycles |
Number of quad-cycles taken to initialize SGPRs when launching waves |
Compute unit counters#
The compute unit counters are further classified into instruction mix, matrix fused multiply-add (FMA) operation counters, level counters, wavefront counters, wavefront cycle counters, and LDS counters.
Instruction mix#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Instr |
Number of instructions issued |
|
Instr |
Number of vector arithmetic logic unit (VALU) instructions including matrix FMA issued |
|
Instr |
Number of VALU half-precision floating-point (F16) |
|
Instr |
Number of VALU F16 Multiply instructions issued |
|
Instr |
Number of VALU F16 FMA or multiply-add instructions issued |
|
Instr |
Number of VALU F16 Transcendental instructions issued |
|
Instr |
Number of VALU full-precision floating-point (F32) |
|
Instr |
Number of VALU F32 Multiply instructions issued |
|
Instr |
Number of VALU F32 FMAor multiply-add instructions issued |
|
Instr |
Number of VALU F32 Transcendental instructions issued |
|
Instr |
Number of VALU F64 |
|
Instr |
Number of VALU F64 Multiply instructions issued |
|
Instr |
Number of VALU F64 FMA or multiply-add instructions issued |
|
Instr |
Number of VALU F64 Transcendental instructions issued |
|
Instr |
Number of VALU 32-bit integer instructions (signed or unsigned) issued |
|
Instr |
Number of VALU 64-bit integer instructions (signed or unsigned) issued |
|
Instr |
Number of VALU Conversion instructions issued |
|
Instr |
Number of 8-bit Integer matrix FMA instructions issued |
|
Instr |
Number of F16 matrix FMA instructions issued |
|
Instr |
Number of F32 matrix FMA instructions issued |
|
Instr |
Number of F64 matrix FMA instructions issued |
|
Instr |
Number of matrix FMA instructions issued |
|
Instr |
Number of vector memory write instructions (including flat) issued |
|
Instr |
Number of vector memory read instructions (including flat) issued |
|
Instr |
Number of vector memory instructions issued, including both flat and buffer instructions |
|
Instr |
Number of scalar arithmetic logic unit (SALU) instructions issued |
|
Instr |
Number of scalar memory instructions issued |
|
Instr |
Number of scalar memory instructions normalized to match |
|
Instr |
Number of flat instructions issued |
|
Instr |
MI200 series only Number of FLAT instructions that read/write only from/to LDS issued. Works only if |
|
Instr |
Number of LDS instructions issued (MI200: includes flat; MI300: does not include flat) |
|
Instr |
Number of global data share instructions issued |
|
Instr |
Number of EXP and global data share instructions excluding skipped export instructions issued |
|
Instr |
Number of branch instructions issued |
|
Instr |
Number of |
|
Instr |
Number of vector instructions skipped |
Flat instructions allow read, write, and atomic access to a generic memory address pointer that can resolve to any of the following physical memories:
Global Memory
Scratch (“private”)
LDS (“shared”)
Invalid -
MEM_VIOL
TrapStatus
Matrix fused multiply-add operation counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
IOP |
Number of 8-bit integer matrix FMA ops in the unit of 512 |
|
FLOP |
Number of F16 floating matrix FMA ops in the unit of 512 |
|
FLOP |
Number of BF16 floating matrix FMA ops in the unit of 512 |
|
FLOP |
Number of F32 floating matrix FMA ops in the unit of 512 |
|
FLOP |
Number of F64 floating matrix FMA ops in the unit of 512 |
Level counters#
Note
All level counters must be followed by SQ_ACCUM_PREV_HIRES
counter to measure average latency.
Hardware counter |
Unit |
Definition |
---|---|---|
|
Count |
Accumulated counter sample value where accumulation takes place once every four cycles |
|
Count |
Accumulated counter sample value where accumulation takes place once every cycle |
|
Waves |
Number of inflight waves |
|
Instr |
Number of inflight vector memory (including flat) instructions |
|
Instr |
Number of inflight scalar memory instructions |
|
Instr |
Number of inflight LDS (including flat) instructions |
|
Instr |
Number of inflight instruction fetch requests from the cache |
Use the following formulae to calculate latencies:
Vector memory latency =
SQ_ACCUM_PREV_HIRES
divided bySQ_INSTS_VMEM
Wave latency =
SQ_ACCUM_PREV_HIRES
divided bySQ_WAVE
LDS latency =
SQ_ACCUM_PREV_HIRES
divided bySQ_INSTS_LDS
Scalar memory latency =
SQ_ACCUM_PREV_HIRES
divided bySQ_INSTS_SMEM_NORM
Instruction fetch latency =
SQ_ACCUM_PREV_HIRES
divided bySQ_IFETCH
Wavefront counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Waves |
Number of wavefronts dispatched to sequencers, including both new and restored wavefronts |
|
Waves |
Number of context-saved waves |
|
Waves |
Number of context-restored waves sent to sequencers |
|
Waves |
Number of wavefronts with exactly 64 active threads sent to sequencers |
|
Waves |
Number of wavefronts with less than 64 active threads sent to sequencers |
|
Waves |
Number of wavefronts with less than 48 active threads sent to sequencers |
|
Waves |
Number of wavefronts with less than 32 active threads sent to sequencers |
|
Waves |
Number of wavefronts with less than 16 active threads sent to sequencers |
Wavefront cycle counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Clock cycles |
|
Cycles |
Number of cycles while sequencers reports it to be busy |
|
Qcycles |
Number of quad-cycles each compute unit is busy |
|
Cycles |
Number of cycles the matrix FMA arithmetic logic unit (ALU) is busy |
|
Qcycles |
Number of quad-cycles spent by waves in the compute units |
|
Qcycles |
Number of quad-cycles spent waiting for anything |
|
Qcycles |
Number of quad-cycles spent waiting for any instruction to be issued |
|
Qcycles |
Number of quad-cycles spent by each wave to work on an instruction |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on a vector memory instruction |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on an LDS instruction |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on a VALU instruction |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on a SALU or scalar memory instruction |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on an |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on a |
|
Qcycles |
Number of quad-cycles spent by the sequencer instruction arbiter to work on a flat instruction |
|
Qcycles |
Number of quad-cycles spent to send addr and cmd data for vector memory write instructions |
|
Qcycles |
Number of quad-cycles spent to send addr and cmd data for vector memory read instructions |
|
Qcycles |
Number of quad-cycles spent to execute scalar memory reads |
|
Qcycles |
Number of quad-cycles spent to execute non-memory read scalar operations |
|
Qcycles |
Number of quad-cycles spent to execute VALU operations on active threads |
|
Qcycles |
Number of quad-cycles spent waiting for LDS instruction to be issued |
SQ_THREAD_CYCLES_VALU
is similar to INST_CYCLES_VALU
, but it’s multiplied by the number of
active threads.
LDS counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Cycles |
Number of atomic return cycles in LDS |
|
Cycles |
Number of cycles LDS is stalled by bank conflicts |
|
Cycles |
Number of cycles LDS is stalled by address conflicts |
|
Cycles |
Number of cycles LDS is stalled processing flat unaligned load or store operations |
|
Count |
Number of threads that have a memory violation in the LDS |
|
Cycles |
Number of cycles LDS is used for indexed operations |
Miscellaneous counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Count |
Number of instruction fetch requests from L1i, in 32-byte width |
|
Threads |
Number of valid items per wave |
L1 instruction cache (L1i) and scalar L1 data cache (L1d) counters#
Hardware counter |
Unit |
Definition |
---|---|---|
|
Req |
Number of L1 instruction (L1i) cache requests |
|
Count |
Number of L1i cache hits |
|
Count |
Number of non-duplicate L1i cache misses including uncached requests |
|
Count |
Number of duplicate L1i cache misses whose previous lookup miss on the same cache line is not fulfilled yet |
|
Req |
Number of scalar L1d requests |
|
Cycles |
Number of cycles while sequencer input is valid but scalar L1d is not ready |
|
Count |
Number of scalar L1d hits |
|
Count |
Number of non-duplicate scalar L1d misses including uncached requests |
|
Count |
Number of duplicate scalar L1d misses |
|
Req |
Number of constant cache read requests in a single 32-bit data word |
|
Req |
Number of constant cache read requests in two 32-bit data words |
|
Req |
Number of constant cache read requests in four 32-bit data words |
|
Req |
Number of constant cache read requests in eight 32-bit data words |
|
Req |
Number of constant cache read requests in 16 32-bit data words |
|
Req |
Number of atomic requests |
|
Req |
Number of texture cache requests that were issued by instruction and constant caches |
|
Req |
Number of instruction requests to the L2 cache |
|
Req |
Number of data Read requests to the L2 cache |
|
Req |
Number of data write requests to the L2 cache |
|
Req |
Number of data atomic requests to the L2 cache |
|
Cycles |
Number of cycles while the valid requests to the L2 cache are stalled |
Vector L1 cache subsystem counters#
The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1d or texture cache per pipe, and texture cache arbiter counters.
Texture addressing unit counters#
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycles |
Texture addressing unit busy cycles |
0-15 |
|
Instr |
Number of wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of buffer wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of buffer read wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of buffer write wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of buffer atomic wavefronts processed by texture addressing unit |
0-15 |
|
Cycles |
Number of buffer cycles (including read and write) issued to texture cache |
0-15 |
|
Cycles |
Number of coalesced buffer read cycles issued to texture cache |
0-15 |
|
Cycles |
Number of coalesced buffer write cycles issued to texture cache |
0-15 |
|
Cycles |
Number of cycles texture addressing unit address path is stalled by texture cache |
0-15 |
|
Cycles |
Number of cycles texture addressing unit data path is stalled by texture cache |
0-15 |
|
Cycles |
Number of cycles texture addressing unit address path is stalled by texture data unit |
0-15 |
|
Instr |
Number of flat opcode wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of flat opcode read wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of flat opcode write wavefronts processed by texture addressing unit |
0-15 |
|
Instr |
Number of flat opcode atomic wavefronts processed by texture addressing unit |
0-15 |
Texture data unit counters#
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycle |
Texture data unit busy cycles while it is processing or waiting for data |
0-15 |
|
Cycle |
Number of cycles texture data unit is stalled waiting for texture cache data |
0-15 |
|
Cycle |
Number of cycles texture data unit is stalled by shader processor input |
0-15 |
|
Instr |
Number of wavefront instructions (read, write, atomic) |
0-15 |
|
Instr |
Number of write wavefront instructions |
0-15 |
|
Instr |
Number of atomic wavefront instructions |
0-15 |
|
Instr |
Number of coalescable wavefronts according to texture addressing unit |
0-15 |
Texture cache per pipe counters#
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycles |
Number of cycles vector L1d interface clocks are turned on |
0-15 |
|
Cycles |
Number of cycles vector L1d core clocks are turned on |
0-15 |
|
Cycles |
Number of cycles texture data unit stalls vector L1d |
0-15 |
|
Cycles |
Number of cycles texture cache router stalls vector L1d |
0-15 |
|
Cycles |
Number of cycles tag RAM conflict stalls on a read |
0-15 |
|
Cycles |
Number of cycles tag RAM conflict stalls on a write |
0-15 |
|
Cycles |
Number of cycles tag RAM conflict stalls on an atomic |
0-15 |
|
Cycles |
Number of cycles vector L1d is stalled due to data pending from L2 Cache |
0-15 |
|
Cycles |
Number of cycles texture cache per pipe stalls texture addressing unit data interface |
NA |
|
Req |
Number of state reads |
0-15 |
|
Req |
Number of L1 volatile pixels or buffers from texture addressing unit |
0-15 |
|
Req |
Number of vector L1d accesses. Equals |
0-15 |
|
Req |
Number of vector L1d read accesses |
0-15 |
|
Req |
Number of vector L1d write accesses |
0-15 |
|
Req |
Number of vector L1d atomic requests with return |
0-15 |
|
Req |
Number of vector L1d atomic without return |
0-15 |
|
Count |
Total number of vector L1d writebacks and invalidates |
0-15 |
|
Req |
Number of address translation requests to unified translation cache (L1) |
0-15 |
|
Req |
Number of unified translation cache (L1) translation hits |
0-15 |
|
Req |
Number of unified translation cache (L1) translation misses |
0-15 |
|
Req |
Number of unified translation cache (L1) permission misses |
0-15 |
|
Req |
Number of vector L1d cache accesses including hits and misses |
0-15 |
|
Cycles |
MI200 series only Accumulated wave access latency to vL1D over all wavefronts |
0-15 |
|
Cycles |
MI200 series only Total vL1D to L2 request latency over all wavefronts for reads and atomics with return |
0-15 |
|
Cycles |
MI200 series only Total vL1D to L2 request latency over all wavefronts for writes and atomics without return |
0-15 |
|
Req |
Number of read requests to L2 cache |
0-15 |
|
Req |
Number of write requests to L2 cache |
0-15 |
|
Req |
Number of atomic requests to L2 cache with return |
0-15 |
|
Req |
Number of atomic requests to L2 cache without return |
0-15 |
|
Req |
Number of non-coherently cached read requests to L2 cache |
0-15 |
|
Req |
Number of uncached read requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached read requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached with write read requests to L2 cache |
0-15 |
|
Req |
Number of non-coherently cached write requests to L2 cache |
0-15 |
|
Req |
Number of uncached write requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached write requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached with write write requests to L2 cache |
0-15 |
|
Req |
Number of non-coherently cached atomic requests to L2 cache |
0-15 |
|
Req |
Number of uncached atomic requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached atomic requests to L2 cache |
0-15 |
|
Req |
Number of coherently cached with write atomic requests to L2 cache |
0-15 |
Note that:
TCP_TOTAL_READ[n]
=TCP_PERF_SEL_TOTAL_HIT_LRU_READ
+TCP_PERF_SEL_TOTAL_MISS_LRU_READ
+TCP_PERF_SEL_TOTAL_MISS_EVICT_READ
TCP_TOTAL_WRITE[n]
=TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE``+ ``TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE
TCP_TOTAL_WRITEBACK_INVALIDATES[n]
=TCP_PERF_SEL_TOTAL_WBINVL1``+ ``TCP_PERF_SEL_TOTAL_WBINVL1_VOL``+ ``TCP_PERF_SEL_CP_TCP_INVALIDATE``+ ``TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL
Texture cache arbiter counters#
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycles |
Number of texture cache arbiter cycles |
0-31 |
|
Cycles |
Number of cycles texture cache arbiter has a pending request |
0-31 |
L2 cache access counters#
L2 cache is also known as texture cache per channel.
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycles |
Number of L2 cache free-running clocks |
0-31 |
|
Cycles |
Number of L2 cache busy cycles |
0-31 |
|
Req |
Number of L2 cache requests of all types (measured at the tag block) |
0-31 |
|
Req |
Number of L2 cache streaming requests (measured at the tag block) |
0-31 |
|
Req |
Number of non-coherently cached requests (measured at the tag block) |
0-31 |
|
Req |
Number of uncached requests. This is measured at the tag block |
0-31 |
|
Req |
Number of coherently cached requests. This is measured at the tag block |
0-31 |
|
Req |
Number of coherently cached with write requests. This is measured at the tag block |
0-31 |
|
Req |
Number of probe requests |
0-31 |
|
Req |
Number of external probe requests with |
0-31 |
|
Req |
Number of L2 cache read requests (includes compressed reads but not metadata reads) |
0-31 |
|
Req |
Number of L2 cache write requests |
0-31 |
|
Req |
Number of L2 cache atomic requests of all types |
0-31 |
|
Req |
Number of L2 cache hits |
0-31 |
|
Req |
Number of L2 cache misses |
0-31 |
|
Req |
Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests |
0-31 |
|
Req |
Number of 32-byte and 64-byte transactions going over the |
0-31 |
|
Req |
Total number of 64-byte transactions (write or |
0-31 |
|
Req |
Number of 32 or 64-byte write or atomic going over the |
0-31 |
|
Cycles |
Number of cycles a write request is stalled |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits |
0-31 |
|
Cycles |
Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests |
0-31 |
|
Req |
The accumulated number of efficiency arbiter write requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte atomic requests going over the |
0-31 |
|
Req |
The accumulated number of efficiency arbiter atomic requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte read requests to efficiency arbiter |
0-31 |
|
Req |
Number of 32-byte read requests to efficiency arbiter |
0-31 |
|
Req |
Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2 |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of IO credits |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of GMI credits |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of DRAM credits |
0-31 |
|
Req |
The accumulated number of efficiency arbiter read requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM) |
0-31 |
|
Req |
Number of 32-byte or 64-byte efficiency arbiter write requests to HBM |
0-31 |
|
Cycles |
Number of cycles the normal request pipeline in the tag is stalled for any reason |
0-31 |
|
Req |
Number of writebacks due to requests that are not writeback requests |
0-31 |
|
Req |
Number of writebacks due to all |
0-31 |
|
Req |
Number of evictions due to requests that are not invalidate or probe requests |
0-31 |
|
Req |
Number of evictions due to all |
0-31 |
Hardware counter |
Unit |
Definition |
Value range for |
---|---|---|---|
|
Cycles |
Number of L2 cache free-running clocks |
0-31 |
|
Cycles |
Number of L2 cache busy cycles |
0-31 |
|
Req |
Number of L2 cache requests of all types (measured at the tag block) |
0-31 |
|
Req |
Number of L2 cache streaming requests (measured at the tag block) |
0-31 |
|
Req |
Number of non-coherently cached requests (measured at the tag block) |
0-31 |
|
Req |
Number of uncached requests. This is measured at the tag block |
0-31 |
|
Req |
Number of coherently cached requests. This is measured at the tag block |
0-31 |
|
Req |
Number of coherently cached with write requests. This is measured at the tag block |
0-31 |
|
Req |
Number of probe requests |
0-31 |
|
Req |
Number of external probe requests with |
0-31 |
|
Req |
Number of L2 cache read requests (includes compressed reads but not metadata reads) |
0-31 |
|
Req |
Number of L2 cache write requests |
0-31 |
|
Req |
Number of L2 cache atomic requests of all types |
0-31 |
|
Req |
Number of L2 cache hits |
0-31 |
|
Req |
Number of L2 cache misses |
0-31 |
|
Req |
Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests |
0-31 |
|
Req |
Number of 32-byte and 64-byte transactions going over the |
0-31 |
|
Req |
Total number of 64-byte transactions (write or |
0-31 |
|
Req |
Number of 32 write or atomic going over the |
0-31 |
|
Cycles |
Number of cycles a write request is stalled |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits |
0-31 |
|
Cycles |
Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits |
0-31 |
|
Cycles |
Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests |
0-31 |
|
Req |
The accumulated number of efficiency arbiter write requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte atomic requests going over the |
0-31 |
|
Req |
The accumulated number of efficiency arbiter atomic requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte read requests to efficiency arbiter |
0-31 |
|
Req |
Number of 32-byte read requests to efficiency arbiter |
0-31 |
|
Req |
Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2 |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of IO credits |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of GMI credits |
0-31 |
|
Cycles |
Number of cycles there is a stall due to the read request interface running out of DRAM credits |
0-31 |
|
Req |
The accumulated number of efficiency arbiter read requests in flight |
0-31 |
|
Req |
Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM) |
0-31 |
|
Req |
Number of 32-byte or 64-byte efficiency arbiter write requests to HBM |
0-31 |
|
Cycles |
Number of cycles the normal request pipeline in the tag is stalled for any reason |
0-31 |
|
Req |
Number of writebacks due to requests that are not writeback requests |
0-31 |
|
Req |
Number of writebacks due to all |
0-31 |
|
Req |
Number of evictions due to requests that are not invalidate or probe requests |
0-31 |
|
Req |
Number of evictions due to all |
0-31 |
Note the following:
TCC_REQ[n]
may be more than the number of requests arriving at the texture cache per channel, but it’s a good indication of the total amount of work that needs to be performed.For
TCC_EA0_WRREQ[n]
, atomics may travel over the same interface and are generally classified as write requests.CC mtypes can produce uncached requests, and those are included in
TCC_EA0_WR_UNCACHED_32B[n]
TCC_EA0_WRREQ_LEVEL[n]
is primarily intended to measure average efficiency arbiter write latency.Average write latency =
TCC_PERF_SEL_EA0_WRREQ_LEVEL
divided byTCC_PERF_SEL_EA0_WRREQ
TCC_EA0_ATOMIC_LEVEL[n]
is primarily intended to measure average efficiency arbiter atomic latencyAverage atomic latency =
TCC_PERF_SEL_EA0_WRREQ_ATOMIC_LEVEL
divided byTCC_PERF_SEL_EA0_WRREQ_ATOMIC
TCC_EA0_RDREQ_LEVEL[n]
is primarily intended to measure average efficiency arbiter read latency.Average read latency =
TCC_PERF_SEL_EA0_RDREQ_LEVEL
divided byTCC_PERF_SEL_EA0_RDREQ
Stalls can occur regardless of the need for a read to be performed
Normally, stalls are measured exactly at one point in the pipeline however in the case of
TCC_TAG_STALL[n]
, probes can stall the pipeline at a variety of places. There is no single point that can accurately measure the total stalls
MI300 and MI200 series derived metrics list#
Hardware counter |
Definition |
---|---|
|
Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready (value range: 0% (optimal) to 100%) |
|
Total kilobytes fetched from the video memory; measured with all extra fetches and any cache or memory effects taken into account |
|
Average number of flat instructions that read from or write to LDS, run per work item (affected by flow control) |
|
Average number of flat instructions that read from or write to the video memory, run per work item (affected by flow control). Includes flat instructions that read from or write to scratch |
|
Average number of global data share read or write instructions run per work item (affected by flow control) |
|
Percentage of time GPU is busy |
|
Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache (value range: 0% (no hit) to 100% (optimal)) |
|
Percentage of GPU time LDS is stalled by bank conflicts (value range: 0% (optimal) to 100%) |
|
Average number of LDS read or write instructions run per work item (affected by flow control). Excludes flat instructions that read from or write to LDS. |
|
Percentage of GPU time the memory unit is active, which is measured with all extra fetches and writes and any cache or memory effects taken into account (value range: 0% to 100% (fetch-bound)) |
|
Percentage of GPU time the memory unit is stalled (value range: 0% (optimal) to 100%) |
|
Total number of effective 32B write transactions to the memory |
|
Total number of cycles texture cache arbiter has a pending request, over all texture cache arbiter instances |
|
Total number of cycles over all texture cache arbiter instances |
|
Percentage of GPU time scalar ALU instructions are processed (value range: 0% to 100% (optimal)) |
|
Average number of scalar ALU instructions run per work item (affected by flow control) |
|
Average number of scalar fetch instructions from the video memory run per work item (affected by flow control) |
|
Percentage of GPU time vector ALU instructions are processed (value range: 0% to 100% (optimal)) |
|
Average number of vector ALU instructions run per work item (affected by flow control) |
|
Percentage of active vector ALU threads in a wave, where a lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64 (value range: 0%, 100% (optimal - no thread divergence)) |
|
Average number of vector fetch instructions from the video memory run per work-item (affected by flow control); excludes flat instructions that fetch from video memory |
|
Average number of vector write instructions to the video memory run per work-item (affected by flow control); excludes flat instructions that write to video memory |
|
Total wavefronts |
|
Total number of 32-byte effective memory writes |
|
Total kilobytes written to the video memory; measured with all extra fetches and any cache or memory effects taken into account |
|
Percentage of GPU time the write unit is stalled (value range: 0% (optimal) to 100%) |
You can lower ALUStalledByLDS
by reducing LDS bank conflicts or number of LDS accesses.
You can lower MemUnitStalled
by reducing the number or size of fetches and writes.
MemUnitBusy
includes the stall time (MemUnitStalled
).
Hardware counters by and over all texture addressing unit instances#
The following table shows the hardware counters by all texture addressing unit instances.
Hardware counter |
Definition |
---|---|
|
Total number of buffer wavefronts processed |
|
Total number of buffer read wavefronts processed |
|
Total number of buffer write wavefronts processed |
|
Total number of buffer atomic wavefronts processed |
|
Total number of buffer cycles (including read and write) issued to texture cache |
|
Total number of coalesced buffer read cycles issued to texture cache |
|
Total number of coalesced buffer write cycles issued to texture cache |
|
Sum of flat opcode reads processed |
|
Sum of flat opcode writes processed |
|
Total number of flat opcode wavefronts processed |
|
Total number of flat opcode read wavefronts processed |
|
Total number of flat opcode atomic wavefronts processed |
|
Total number of wavefronts processed |
The following table shows the hardware counters over all texture addressing unit instances.
Hardware counter |
Definition |
---|---|
|
Total number of cycles texture addressing unit address path is stalled by texture cache |
|
Total number of cycles texture addressing unit address path is stalled by texture data unit |
|
Average number of busy cycles |
|
Maximum number of texture addressing unit busy cycles |
|
Minimum number of texture addressing unit busy cycles |
|
Total number of cycles texture addressing unit data path is stalled by texture cache |
|
Total number of texture addressing unit busy cycles |
Hardware counters over all texture cache per channel instances#
Hardware counter |
Definition |
---|---|
|
Total number of writebacks due to all |
|
Total number of evictions due to all |
|
Total number of L2 cache atomic requests of all types. |
|
Average number of L2 cache busy cycles. |
|
Total number of L2 cache busy cycles. |
|
Total number of coherently cached requests. |
|
Total number of L2 cache free running clocks. |
|
Total number of 32-byte and 64-byte transactions going over the |
|
Total number of 64-byte transactions (write or CMPSWAP) going over the |
|
Total Number of 32-byte write or atomic going over the |
|
Total Number of cycles a write request is stalled, over all instances. |
|
Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of IO credits, over all instances. |
|
Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits, over all instances. |
|
Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits, over all instances. |
|
Total number of efficiency arbiter write requests in flight. |
|
Total number of efficiency arbiter read requests in flight. |
|
Total Number of 32-byte or 64-byte atomic requests going over the |
|
Total number of efficiency arbiter atomic requests in flight. |
|
Total number of 32-byte or 64-byte read requests to efficiency arbiter. |
|
Total number of 32-byte read requests to efficiency arbiter. |
|
Total number of 32-byte efficiency arbiter reads due to uncached traffic. |
|
Total number of cycles there is a stall due to the read request interface running out of IO credits. |
|
Total number of cycles there is a stall due to the read request interface running out of GMI credits. |
|
Total number of cycles there is a stall due to the read request interface running out of DRAM credits. |
|
Total number of 32-byte or 64-byte efficiency arbiter read requests to HBM. |
|
Total number of 32-byte or 64-byte efficiency arbiter write requests to HBM. |
|
Total number of L2 cache hits. |
|
Total number of L2 cache misses. |
|
Total number of non-coherently cached requests. |
|
Total number of writebacks due to requests that are not writeback requests. |
|
Total number of evictions due to requests that are not invalidate or probe requests. |
|
Total number of probe requests. |
|
Total number of external probe requests with |
|
Total number of L2 cache read requests (including compressed reads but not metadata reads). |
|
Total number of all types of L2 cache requests. |
|
Total number of coherently cached with write requests. |
|
Total number of L2 cache streaming requests. |
|
Total number of cycles the normal request pipeline in the tag is stalled for any reason. |
|
Total number of cycles L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests. |
|
Total number of uncached requests. |
|
Total number of L2 cache write requests. |
|
Total number of lines written back to the main memory including writebacks of dirty lines and uncached write or atomic requests. |
|
Maximum number of cycles a write request is stalled. |
Hardware counters by, for, or over all texture cache per pipe instances#
The following table shows the hardware counters by all texture cache per pipe instances.
Hardware counter |
Definition |
---|---|
|
Total number of state reads by ATCPPI |
|
Total number of vector L1d accesses (including hits and misses) |
|
Total number of unified translation cache (L1) permission misses |
|
Total number of address translation requests to unified translation cache (L1) |
|
Total number of unified translation cache (L1) translation misses |
|
Total number of unified translation cache (L1) translation hits |
The following table shows the hardware counters for all texture cache per pipe instances.
Hardware counter |
Definition |
---|---|
|
Total vector L1d to L2 request latency over all wavefronts for reads and atomics with return |
|
Total vector L1d to L2 request latency over all wavefronts for writes and atomics without return |
|
Total wave access latency to vector L1d over all wavefronts |
The following table shows the hardware counters over all texture cache per pipe instances.
Hardware counter |
Definition |
---|---|
|
Total number of cycles tag RAM conflict stalls on an atomic |
|
Total number of cycles vector L1d interface clocks are turned on |
|
Total number of cycles vector L1d core clocks are turned on |
|
Total number of cycles vector L1d cache is stalled due to data pending from L2 Cache |
|
Total number of cycles tag RAM conflict stalls on a read |
|
Total number of atomic requests to L2 cache with return |
|
Total number of atomic requests to L2 cache without return |
|
Total number of coherently cached read requests to L2 cache |
|
Total number of coherently cached write requests to L2 cache |
|
Total number of coherently cached atomic requests to L2 cache |
|
Total number of non-coherently cached read requests to L2 cache |
|
Total number of non-coherently cached write requests to L2 cache |
|
Total number of non-coherently cached atomic requests to L2 cache |
|
Total number of read requests to L2 cache |
|
Total number of coherently cached with write read requests to L2 cache |
|
Total number of coherently cached with write write requests to L2 cache |
|
Total number of coherently cached with write atomic requests to L2 cache |
|
Total number of uncached read requests to L2 cache |
|
Total number of uncached write requests to L2 cache |
|
Total number of uncached atomic requests to L2 cache |
|
Total number of write requests to L2 cache |
|
Total number of cycles texture cache router stalls vector L1d |
|
Total number of cycles texture data unit stalls vector L1d |
|
Total number of vector L1d accesses |
|
Total number of vector L1d read accesses |
|
Total number of vector L1d write accesses |
|
Total number of vector L1d atomic requests with return |
|
Total number of vector L1d atomic requests without return |
|
Total number of vector L1d writebacks and invalidates |
|
Total number of L1 volatile pixels or buffers from texture addressing unit |
|
Total number of cycles tag RAM conflict stalls on a write |
Hardware counter over all texture data unit instances#
Hardware counter |
Definition |
---|---|
|
Total number of atomic wavefront instructions |
|
Total number of coalescable wavefronts according to texture addressing unit |
|
Total number of wavefront instructions (read, write, atomic) |
|
Total number of cycles texture data unit is stalled by shader processor input |
|
Total number of write wavefront instructions |
|
Total number of cycles texture data unit is stalled waiting for texture cache data |
|
Total number of texture data unit busy cycles while it is processing or waiting for data |