MI350 series performance counters#

2025-08-28

11 min read time

Applies to Linux and Windows

This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using ROCprofiler-SDK and ROCm Compute Profiler.

The following sections list the performance counters based on the IP blocks.

Command processor packet processor counters (CPC)#

Hardware counter

Definition

CPC_ALWAYS_COUNT

Always count.

CPC_ADC_VALID_CHUNK_NOT_AVAIL

ADC valid chunk is not available when dispatch walking is in progress in the multi-xcc mode.

CPC_ADC_DISPATCH_ALLOC_DONE

ADC dispatch allocation is done.

CPC_ADC_VALID_CHUNK_END

ADC crawler’s valid chunk end in the multi-xcc mode.

CPC_SYNC_FIFO_FULL_LEVEL

SYNC FIFO full last cycles.

CPC_SYNC_FIFO_FULL

SYNC FIFO full times.

CPC_GD_BUSY

ADC busy.

CPC_TG_SEND

ADC thread group send.

CPC_WALK_NEXT_CHUNK

ADC walking next valid chunk in the multi-xcc mode.

CPC_STALLED_BY_SE0_SPI

ADC CSDATA stalled by SE0SPI.

CPC_STALLED_BY_SE1_SPI

ADC CSDATA stalled by SE1SPI.

CPC_STALLED_BY_SE2_SPI

ADC CSDATA stalled by SE2SPI.

CPC_STALLED_BY_SE3_SPI

ADC CSDATA stalled by SE3SPI.

CPC_LTE_ALL

CPC sync counter LteAll. Only Master XCD manages LteAll.

CPC_SYNC_WRREQ_FIFO_BUSY

CPC sync counter request FIFO is not empty.

CPC_CANE_BUSY

CPC CANE bus is busy, which indicates the presence of inflight sync counter requests.

CPC_CANE_STALL

CPC sync counter sending is stalled by CANE.

Shader pipe interpolators (SPI) counters#

Hardware counter

Definition

SPI_CS0_WINDOW_VALID

Clock count enabled by PIPE0 perfcounter_start event.

SPI_CS0_BUSY

Number of clocks with outstanding waves for PIPE0 (SPI or SH).

SPI_CS0_NUM_THREADGROUPS

Number of thread groups launched for PIPE0.

SPI_CS0_CRAWLER_STALL

Number of clocks when PIPE0 event or wave order FIFO is full.

SPI_CS0_EVENT_WAVE

Number of PIPE0 events and waves.

SPI_CS0_WAVE

Number of PIPE0 waves.

SPI_CS1_WINDOW_VALID

Clock count enabled by PIPE1 perfcounter_start event.

SPI_CS1_BUSY

Number of clocks with outstanding waves for PIPE1 (SPI or SH).

SPI_CS1_NUM_THREADGROUPS

Number of thread groups launched for PIPE1.

SPI_CS1_CRAWLER_STALL

Number of clocks when PIPE1 event or wave order FIFO is full.

SPI_CS1_EVENT_WAVE

Number of PIPE1 events and waves.

SPI_CS1_WAVE

Number of PIPE1 waves.

SPI_CS2_WINDOW_VALID

Clock count enabled by PIPE2 perfcounter_start event.

SPI_CS2_BUSY

Number of clocks with outstanding waves for PIPE2 (SPI or SH).

SPI_CS2_NUM_THREADGROUPS

Number of thread groups launched for PIPE2.

SPI_CS2_CRAWLER_STALL

Number of clocks when PIPE2 event or wave order FIFO is full.

SPI_CS2_EVENT_WAVE

Number of PIPE2 events and waves.

SPI_CS2_WAVE

Number of PIPE2 waves.

SPI_CS3_WINDOW_VALID

Clock count enabled by PIPE3 perfcounter_start event.

SPI_CS3_BUSY

Number of clocks with outstanding waves for PIPE3 (SPI or SH).

SPI_CS3_NUM_THREADGROUPS

Number of thread groups launched for PIPE3.

SPI_CS3_CRAWLER_STALL

Number of clocks when PIPE3 event or wave order FIFO is full.

SPI_CS3_EVENT_WAVE

Number of PIPE3 events and waves.

SPI_CS3_WAVE

Number of PIPE3 waves.

SPI_CSQ_P0_Q0_OCCUPANCY

Sum of occupancy info for PIPE0 Queue0.

SPI_CSQ_P0_Q1_OCCUPANCY

Sum of occupancy info for PIPE0 Queue1.

SPI_CSQ_P0_Q2_OCCUPANCY

Sum of occupancy info for PIPE0 Queue2.

SPI_CSQ_P0_Q3_OCCUPANCY

Sum of occupancy info for PIPE0 Queue3.

SPI_CSQ_P0_Q4_OCCUPANCY

Sum of occupancy info for PIPE0 Queue4.

SPI_CSQ_P0_Q5_OCCUPANCY

Sum of occupancy info for PIPE0 Queue5.

SPI_CSQ_P0_Q6_OCCUPANCY

Sum of occupancy info for PIPE0 Queue6.

SPI_CSQ_P0_Q7_OCCUPANCY

Sum of occupancy info for PIPE0 Queue7.

SPI_CSQ_P1_Q0_OCCUPANCY

Sum of occupancy info for PIPE1 Queue0.

SPI_CSQ_P1_Q1_OCCUPANCY

Sum of occupancy info for PIPE1 Queue1.

SPI_CSQ_P1_Q2_OCCUPANCY

Sum of occupancy info for PIPE1 Queue2.

SPI_CSQ_P1_Q3_OCCUPANCY

Sum of occupancy info for PIPE1 Queue3.

SPI_CSQ_P1_Q4_OCCUPANCY

Sum of occupancy info for PIPE1 Queue4.

SPI_CSQ_P1_Q5_OCCUPANCY

Sum of occupancy info for PIPE1 Queue5.

SPI_CSQ_P1_Q6_OCCUPANCY

Sum of occupancy info for PIPE1 Queue6.

SPI_CSQ_P1_Q7_OCCUPANCY

Sum of occupancy info for PIPE1 Queue7.

SPI_CSQ_P2_Q0_OCCUPANCY

Sum of occupancy info for PIPE2 Queue0.

SPI_CSQ_P2_Q1_OCCUPANCY

Sum of occupancy info for PIPE2 Queue1.

SPI_CSQ_P2_Q2_OCCUPANCY

Sum of occupancy info for PIPE2 Queue2.

SPI_CSQ_P2_Q3_OCCUPANCY

Sum of occupancy info for PIPE2 Queue3.

SPI_CSQ_P2_Q4_OCCUPANCY

Sum of occupancy info for PIPE2 Queue4.

SPI_CSQ_P2_Q5_OCCUPANCY

Sum of occupancy info for PIPE2 Queue5.

SPI_CSQ_P2_Q6_OCCUPANCY

Sum of occupancy info for PIPE2 Queue6.

SPI_CSQ_P2_Q7_OCCUPANCY

Sum of occupancy info for PIPE2 Queue7.

SPI_CSQ_P3_Q0_OCCUPANCY

Sum of occupancy info for PIPE3 Queue0.

SPI_CSQ_P3_Q1_OCCUPANCY

Sum of occupancy info for PIPE3 Queue1.

SPI_CSQ_P3_Q2_OCCUPANCY

Sum of occupancy info for PIPE3 Queue2.

SPI_CSQ_P3_Q3_OCCUPANCY

Sum of occupancy info for PIPE3 Queue3.

SPI_CSQ_P3_Q4_OCCUPANCY

Sum of occupancy info for PIPE3 Queue4.

SPI_CSQ_P3_Q5_OCCUPANCY

Sum of occupancy info for PIPE3 Queue5.

SPI_CSQ_P3_Q6_OCCUPANCY

Sum of occupancy info for PIPE3 Queue6.

SPI_CSQ_P3_Q7_OCCUPANCY

Sum of occupancy info for PIPE3 Queue7.

SPI_CSQ_P0_OCCUPANCY

Sum of occupancy info for all PIPE0 queues.

SPI_CSQ_P1_OCCUPANCY

Sum of occupancy info for all PIPE1 queues.

SPI_CSQ_P2_OCCUPANCY

Sum of occupancy info for all PIPE2 queues.

SPI_CSQ_P3_OCCUPANCY

Sum of occupancy info for all PIPE3 queues.

SPI_VWC0_VDATA_VALID_WR

Number of clocks VGPR bus_0 writes VGPRs.

SPI_VWC1_VDATA_VALID_WR

Number of clocks VGPR bus_1 writes VGPRs.

SPI_CSC_WAVE_CNT_BUSY

Number of cycles when there is any wave in the pipe.

Compute unit (SQ) counters#

Hardware counter

Definition

SQ_INSTS_VALU_MFMA_F6F4

Number of VALU V_MFMA_*_F6F4 instructions.

SQ_INSTS_VALU_MFMA_MOPS_F6F4

Number of VALU matrix with the performed math operations (add or mul) divided by 512, assuming a full EXEC mask of F6 or F4 data type.

SQ_ACTIVE_INST_VALU2

Number of quad-cycles when two VALU instructions are issued (per-simd, nondeterministic).

SQ_INSTS_LDS_LOAD

Number of LDS load instructions issued (per-simd, emulated).

SQ_INSTS_LDS_STORE

Number of LDS store instructions issued (per-simd, emulated).

SQ_INSTS_LDS_ATOMIC

Number of LDS atomic instructions issued (per-simd, emulated).

SQ_INSTS_LDS_LOAD_BANDWIDTH

Total number of 64-bytes loaded (instrSize * CountOnes(EXEC))/64 (per-simd, emulated).

SQ_INSTS_LDS_STORE_BANDWIDTH

Total number of 64-bytes written (instrSize * CountOnes(EXEC))/64 (per-simd, emulated).

SQ_INSTS_LDS_ATOMIC_BANDWIDTH

Total number of 64-bytes atomic (instrSize * CountOnes(EXEC))/64 (per-simd, emulated).

SQ_INSTS_VALU_FLOPS_FP16

Counts FLOPS per instruction on float 16 excluding MFMA/SMFMA.

SQ_INSTS_VALU_FLOPS_FP32

Counts FLOPS per instruction on float 32 excluding MFMA/SMFMA.

SQ_INSTS_VALU_FLOPS_FP64

Counts FLOPS per instruction on float 64 excluding MFMA/SMFMA.

SQ_INSTS_VALU_FLOPS_FP16_TRANS

Counts FLOPS per instruction on float 16 trans excluding MFMA/SMFMA.

SQ_INSTS_VALU_FLOPS_FP32_TRANS

Counts FLOPS per instruction on float 32 trans excluding MFMA/SMFMA.

SQ_INSTS_VALU_FLOPS_FP64_TRANS

Counts FLOPS per instruction on float 64 trans excluding MFMA/SMFMA.

SQ_INSTS_VALU_IOPS

Counts OPS per instruction on integer or unsigned or bit data (per-simd, emulated).

SQ_LDS_DATA_FIFO_FULL

Number of cycles LDS data FIFO is full (nondeterministic, unwindowed).

SQ_LDS_CMD_FIFO_FULL

Number of cycles LDS command FIFO is full (nondeterministic, unwindowed).

SQ_VMEM_TA_ADDR_FIFO_FULL

Number of cycles texture requests are stalled due to full address FIFO in TA (nondeterministic, unwindowed).

SQ_VMEM_TA_CMD_FIFO_FULL

Number of cycles texture requests are stalled due to full cmd FIFO in TA (nondeterministic, unwindowed).

SQ_VMEM_WR_TA_DATA_FIFO_FULL

Number of cycles texture writes are stalled due to full data FIFO in TA (nondeterministic, unwindowed).

SQC_ICACHE_MISSES_DUPLICATE

Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic).

SQC_DCACHE_MISSES_DUPLICATE

Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic).

Texture addressing (TA) unit counters#

Hardware counter

Definition

TA_BUFFER_READ_LDS_WAVEFRONTS

Number of buffer read wavefronts for LDS return processed by the TA.

TA_FLAT_READ_LDS_WAVEFRONTS

Number of flat opcode reads for LDS return processed by the TA.

Texture data (TD) unit counters#

Hardware counter

Definition

TD_WRITE_ACKT_WAVEFRONT

Number of write acknowledgments, sent to SQ and not to SP.

TD_TD_SP_TRAFFIC

Number of times this TD sends data to the SP.

Texture cache per pipe (TCP) counters#

Hardware counter

Definition

TCP_TCP_TA_ADDR_STALL_CYCLES

TCP stalls TA addr interface.

TCP_TCP_TA_DATA_STALL_CYCLES

TCP stalls TA data interface. Now windowed.

TCP_LFIFO_STALL_CYCLES

Memory latency FIFOs full stall.

TCP_RFIFO_STALL_CYCLES

Memory Request FIFOs full stall.

TCP_TCR_RDRET_STALL

Write into cache stalled by read return from TCR.

TCP_PENDING_STALL_CYCLES

Stall due to data pending from L2.

TCP_UTCL1_SERIALIZATION_STALL

Total number of stalls caused due to serializing translation requests through the UTCL1.

TCP_UTCL1_THRASHING_STALL

Stall caused by thrashing feature in any probe. Lacks accuracy when the stall signal overlaps between probe0 and probe1, which is worse with MECO of thrashing deadlock. Some probe0 events could miss being counted in with MECO on. This perf count provides a rough thrashing estimate.

TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS

Translation miss_under_miss.

TCP_UTCL1_STALL_INFLIGHT_MAX

Total UTCL1 stalls due to inflight counter saturation.

TCP_UTCL1_STALL_LRU_INFLIGHT

Total UTCL1 stalls due to LRU cache line with inflight traffic.

TCP_UTCL1_STALL_MULTI_MISS

Total UTCL1 stalls due to arbitrated multiple misses.

TCP_UTCL1_LFIFO_FULL

Total UTCL1 and UTCL2 latency, which hides FIFO full cycles.

TCP_UTCL1_STALL_LFIFO_NOT_RES

Total UTCL1 stalls due to UTCL2 latency, which hides FIFO output (not resident).

TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS

Total UTCL1 stalls due to UTCL2_req being out of credits.

TCP_CLIENT_UTCL1_INFLIGHT

The sum of inflight client to UTCL1 requests per cycle.

TCP_TAGRAM0_REQ

Total L2 requests mapping to TagRAM 0 from this TCP to all TCCs.

TCP_TAGRAM1_REQ

Total L2 requests mapping to TagRAM 1 from this TCP to all TCCs.

TCP_TAGRAM2_REQ

Total L2 requests mapping to TagRAM 2 from this TCP to all TCCs.

TCP_TAGRAM3_REQ

Total L2 requests mapping to TagRAM 3 from this TCP to all TCCs.

TCP_TCP_LATENCY

Total TCP wave latency (from the first clock of wave entering to the first clock of wave leaving). Divide by TA_TCP_STATE_READ to find average wave latency.

TCP_TCC_READ_REQ_LATENCY

Total TCP to TCC request latency for reads and atomics with return. Not Windowed.

TCP_TCC_WRITE_REQ_LATENCY

Total TCP to TCC request latency for writes and atomics without return. Not Windowed.

TCP_TCC_WRITE_REQ_HOLE_LATENCY

Total TCP req to TCC hole latency for writes and atomics. Not Windowed.

Texture cache per channel (TCC) counters#

Hardware counter

Definition

TCC_READ_SECTORS

Total number of 32B data sectors in read requests.

TCC_WRITE_SECTORS

Total number of 32B data sectors in write requests.

TCC_ATOMIC_SECTORS

Total number of 32B data sectors in atomic requests.

TCC_BYPASS_REQ

Number of bypass requests. This is measured at the tag block.

TCC_LATENCY_FIFO_FULL

Number of cycles when the latency FIFO is full.

TCC_SRC_FIFO_FULL

Number of cycles when the SRC FIFO is assumed to be full as measured at the IB block.

TCC_EA0_RDREQ_64B

Number of 64-byte TCC/EA read requests.

TCC_EA0_RDREQ_128B

Number of 128-byte TCC/EA read requests.

TCC_IB_REQ

Number of requests through the IB. This measures the number of raw requests from graphics clients to this TCC.

TCC_IB_STALL

Number of cycles when the IB output is stalled.

TCC_EA0_WRREQ_WRITE_DRAM

Number of TCC/EA write requests (32-byte or 64-byte) destined for DRAM (MC).

TCC_EA0_WRREQ_ATOMIC_DRAM

Number of TCC/EA atomic requests (32-byte or 64-byte) destined for DRAM (MC).

TCC_EA0_RDREQ_DRAM_32B

Number of 32-byte TCC/EA read requests due to DRAM traffic. One 64-byte request is counted as two and one 128-byte as four.

TCC_EA0_RDREQ_GMI_32B

Number of 32-byte TCC/EA read requests due to GMI traffic. One 64-byte request is counted as two and one 128-byte as four.

TCC_EA0_RDREQ_IO_32B

Number of 32-byte TCC/EA read requests due to IO traffic. One 64-byte request is counted as two and one 128-byte as four.

TCC_EA0_WRREQ_WRITE_DRAM_32B

Number of 32-byte TCC/EA write requests due to DRAM traffic. One 64-byte request is counted as two.

TCC_EA0_WRREQ_ATOMIC_DRAM_32B

Number of 32-byte TCC/EA atomic requests due to DRAM traffic. One 64-byte request is counted as two.

TCC_EA0_WRREQ_WRITE_GMI_32B

Number of 32-byte TCC/EA write requests due to GMI traffic. One 64-byte request is counted as two.

TCC_EA0_WRREQ_ATOMIC_GMI_32B

Number of 32-byte TCC/EA atomic requests due to GMI traffic. One 64-byte request is counted as two.

TCC_EA0_WRREQ_WRITE_IO_32B

Number of 32-byte TCC/EA write requests due to IO traffic. One 64-byte request is counted as two.

TCC_EA0_WRREQ_ATOMIC_IO_32B

Number of 32-byte TCC/EA atomic requests due to IO traffic. One 64-byte request is counted as two.