GL0 (TCP Vector Cache)#

GL0 is the vector-side TCP cache immediately in front of GL1 inside each shader engine datapath (hardware counters keep the TCP_* prefix on gfx1151).

For GL1 panels and the GL1 Cache Memory Chart table, see GL1. The handoff toward GL2 cache is under GL2.

Note

On RDNA3.5, GL0 and TCP refer to the same cache. Hardware counter names (for example, TCP_REQ_sum) retain the TCP prefix.

GL0 cache panels#

GL0 utilization#

Metric

Description

Unit

GL0 Busy (TCP)

Percentage of cycles the GL0 vector cache (TCP) is actively processing requests. Each Workgroup Processor has its own GL0/TCP instance. Low utilization may indicate compute-bound workloads with minimal memory traffic or idle shader engines.

Percent

GL0 request statistics#

Metric

Description

Unit

Total Requests

Total number of requests to the GL0 vector cache, including reads, writes, and atomics. This represents the overall memory traffic generated by vector memory instructions from the shader cores.

Count per Normalization Unit

Read Requests

Number of read requests to the GL0 vector cache. High read counts indicate memory-intensive load operations. Compare with hit rate to assess cache effectiveness for read traffic.

Count per Normalization Unit

Write Requests

Number of write requests to the GL0 vector cache. Write traffic may include global memory stores and cache writebacks. High write counts indicate write-intensive workloads.

Count per Normalization Unit

Miss Requests

Number of GL0 cache requests that missed and required fetching from the GL1 cache. High miss counts increase memory access latency. Consider improving data locality or access patterns to reduce misses.

Count per Normalization Unit

GL0 cache performance#

Metric

Description

Unit

Hit Rate

Percentage of GL0 cache requests serviced from cache without accessing GL1 cache. Higher hit rates indicate better data locality and lower memory access latency. Low hit rates may indicate working sets exceeding GL0 capacity or poor access patterns.

Percent

GL0-GL1 interface#

Metric

Description

Unit

GL1 Read Requests

Number of read requests forwarded from GL0 (TCP) to GL1 cache due to misses. This represents GL0 miss traffic that must be serviced by higher cache levels.

Count per Normalization Unit

GL1 Read 128B Requests

Number of 128-byte read requests forwarded from GL0 (TCP) to GL1 cache. This represents large cache line fetches for memory-intensive workloads.

Count per Normalization Unit

GL1 Write Requests

Number of write requests forwarded from GL0 (TCP) to GL1 cache. This includes writebacks and stores that missed in GL0.

Count per Normalization Unit

GL0 stalls#

Metric

Description

Unit

TA Req Stall

Cycles the Texture Addresser was stalled waiting for the GL0 cache to accept requests. High stall counts indicate GL0 cache backpressure limiting memory request throughput.

Cycles per Normalization Unit

GL1 Back Pressure

Cycles the GL0 cache was stalled due to backpressure from the GL1 cache. High values indicate GL1 cache contention or bandwidth limitations impacting GL0 throughput.

Cycles per Normalization Unit

Data FIFO Stall

Cycles the GL0 cache data FIFO was stalled. High stall counts may indicate data path congestion or insufficient buffering for high-throughput workloads.

Cycles per Normalization Unit

Memory chart: path up to GL1#

The following Memory Chart tables align with the on-screen flow through instruction and scalar paths, GL0 (TCP), LDS, and the TCP-GL1 interface.

Memory chart - instruction cache#

Metric

Description

Unit

ICache Utilization

Percentage of shader busy cycles spent actively servicing instruction fetch requests. High utilization indicates the instruction cache is keeping pace with shader execution. Low utilization may indicate instruction cache misses causing stalls, or idle shaders.

Percent

ICache Hit Rate

Percentage of instruction cache accesses that are serviced from cache without fetching from the GL1 cache. High hit rates indicate good instruction locality. Low hit rates may indicate large kernels exceeding instruction cache capacity or divergent branches.

Percent

ICache Miss Rate

Percentage of instruction cache accesses that miss and require fetching from the GL1 cache. High miss rates increase instruction fetch latency and may cause shader stalls. Consider reducing kernel code size or improving branch coherence.

Percent

ICache Requests

Count of instruction-cache (SQC ICache) requests issued, per normalization unit. Formula: SQC_ICACHE_REQ_sum / $denom

Requests per Normalization Unit

ICache Request Stall Rate

Percent of shader busy cycles where the instruction cache input interface was stalled (valid without ready). Formula: 100 * SQC_ICACHE_INPUT_VALID_READYB_sum / SQ_BUSY_CYCLES_sum.

Percent

ICache-GL1 Read Bandwidth

Bytes per second of read traffic from the instruction path toward GL1 (texture-cache path instruction requests), using 128 B per SQC_TC_INST_REQ event.

Bytes/s

Memory chart - scalar data cache#

Metric

Description

Unit

Dcache Utilization

Percentage of shader busy cycles spent actively servicing scalar data cache requests. The scalar data cache holds uniform data accessed by scalar instructions. High utilization indicates active use of scalar memory operations.

Percent

Dcache Hit Rate

Percentage of scalar data cache accesses that hit in cache. High hit rates indicate efficient reuse of constant and uniform data. Low hit rates may indicate excessive unique constant data or poor temporal locality.

Percent

Dcache Requests

Count of scalar data-cache (SQC DCache) requests, per normalization unit. Formula: SQC_DCACHE_REQ_sum / $denom

Requests per Normalization Unit

Dcache Request Stall Rate

Percent of shader busy cycles where the scalar data cache input interface was stalled. Formula: 100 * SQC_DCACHE_INPUT_VALID_READYB_sum / SQ_BUSY_CYCLES_sum.

Percent

Dcache-GL1 Read Bandwidth

Bytes per second of scalar read traffic from SQC toward GL1 (SQC_TC_DATA_READ_REQ at 128 B per request).

Bytes/s

Memory chart - TCP cache (GL0 vector cache)#

Metric

Description

Unit

GL0 Cache Hit Rate (TCP Cache)

Percentage of GL0 vector cache (TCP Cache) requests serviced from cache. TCP Cache is the first-level cache for vector memory operations. Higher hit rates reduce traffic to the GL1 cache and lower memory access latency for vector operations.

Percent

TCP Total Requests

Total TCP (vector L0) requests per normalization unit (reads, writes and related traffic aggregated in TCP_REQ_sum).

Requests per Normalization Unit

TCP Read Requests

TCP read requests per normalization unit (TCP_REQ_READ_sum).

Requests per Normalization Unit

TCP Write Requests

TCP write requests per normalization unit (TCP_REQ_WRITE_sum).

Requests per Normalization Unit

TCP Miss Requests

TCP requests that missed in the L0 vector cache and required a fill from GL1C (TCP_REQ_MISS_sum / $denom).

Requests per Normalization Unit

Memory chart - LDS (local data share)#

Metric

Description

Unit

LDS Atomic Instructions

Number of atomic operations executed on Local Data Share memory. Atomic operations provide thread-safe read-modify-write operations but may serialize if multiple work-items access the same address.

Instructions per Normalization Unit

LDS Bank Conflict Rate

Percentage of LDS accesses that experienced bank conflicts. Conflicts occur when multiple work-items in a wavefront access different addresses in the same LDS bank. High conflict rates reduce effective LDS bandwidth. Restructuring data layouts can help.

Percent

LDS Estimated Bandwidth

Estimated achieved bandwidth for Local Data Share operations. This reflects the effective throughput after accounting for bank conflicts. RDNA 3.5 provides 32 LDS banks per compute unit for high-bandwidth shared memory access.

Bytes/s

LDS Instructions

Total LDS (Local Data Share) instructions executed per normalization unit (SQ_INSTS_LDS_sum).

Instructions per Normalization Unit

LDS Instruction Cycles

Cycles spent executing LDS instructions per normalization unit (SQ_INST_CYCLES_LDS_sum).

Cycles per Normalization Unit

LDS Wait Cycles

Cycles waves spent waiting on LDS (wait-state attribution), per normalization unit (SQ_WAIT_INST_LDS_sum).

Cycles per Normalization Unit

Memory chart - TCP-GL1 interface#

Metric

Description

Unit

TCP-GL1 Read Requests

Read requests from TCP to GL1C (miss/refill path), per normalization unit.

Requests per Normalization Unit

TCP-GL1 Write Requests

Write-related requests from TCP toward GL1C, per normalization unit.

Requests per Normalization Unit

TCP-GL1 Read Bandwidth

Bytes per second for TCP→GL1 read traffic (64 B per TCP_GL1_REQ_READ event).

Bytes/s

TCP-GL1 Write Bandwidth

Bytes per second for TCP→GL1 write traffic (64 B per TCP_GL1_REQ_WRITE event).

Bytes/s