GL0 (TCP Vector Cache)

GL0 (TCP Vector Cache)#

GL0 is the vector-side TCP cache immediately in front of GL1, per CU (WGP-pair) in the shader datapath (hardware counters keep the TCP_* prefix on gfx115x).

For GL1 panels and the GL1 Cache Memory Chart table, see GL1. The handoff toward GL2 cache is under GL2 cache.

Note

On RDNA3.5, GL0 and TCP refer to the same cache. Hardware counter names (for example, TCP_REQ_sum) retain the TCP prefix.

GL0 cache panels#

GL0 utilization#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
GL0 Busy (TCP)	Percentage of cycles the GL0 vector cache (TCP) is actively processing requests. Each Workgroup Processor has its own GL0/TCP instance. Low utilization may indicate compute-bound workloads with minimal memory traffic or idle shader engines.	Percent

GL0 request statistics#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
Total Requests	Total number of requests to the GL0 vector cache, including reads, writes, and atomics. This represents the overall memory traffic generated by vector memory instructions from the shader cores.	Count per Normalization Unit
Read Requests	Number of read requests to the GL0 vector cache. High read counts indicate memory-intensive load operations. Compare with hit rate to assess cache effectiveness for read traffic.	Count per Normalization Unit
Write Requests	Number of write requests to the GL0 vector cache. Write traffic may include global memory stores and cache writebacks. High write counts indicate write-intensive workloads.	Count per Normalization Unit
Miss Requests	Number of GL0 cache requests that missed and required fetching from the GL1 cache. High miss counts increase memory access latency. Consider improving data locality or access patterns to reduce misses.	Count per Normalization Unit

GL0 cache performance#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
Hit Rate	Percentage of GL0 cache requests serviced from cache without accessing GL1 cache. Higher hit rates indicate better data locality and lower memory access latency. Low hit rates may indicate working sets exceeding GL0 capacity or poor access patterns.	Percent

GL0-GL1 interface#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
GL1 Read Requests	Number of read requests forwarded from GL0 (TCP) to GL1 cache due to misses. This represents GL0 miss traffic that must be serviced by higher cache levels.	Count per Normalization Unit
GL1 Read 128B Requests	Number of 128-byte read requests forwarded from GL0 (TCP) to GL1 cache. This represents large cache line fetches for memory-intensive workloads.	Count per Normalization Unit
GL1 Write Requests	Number of write requests forwarded from GL0 (TCP) to GL1 cache. This includes writebacks and stores that missed in GL0.	Count per Normalization Unit

GL0 stalls#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
TA Req Stall	Cycles the Texture Addresser was stalled waiting for the GL0 cache to accept requests. High stall counts indicate GL0 cache backpressure limiting memory request throughput.	Cycles per Normalization Unit
GL1 Back Pressure	Cycles the GL0 cache was stalled due to backpressure from the GL1 cache. High values indicate GL1 cache contention or bandwidth limitations impacting GL0 throughput.	Cycles per Normalization Unit
Data FIFO Stall	Cycles the GL0 cache data FIFO was stalled. High stall counts may indicate data path congestion or insufficient buffering for high-throughput workloads.	Cycles per Normalization Unit

Memory chart: path up to GL1#

The following Memory Chart tables align with the on-screen flow through instruction and scalar paths, GL0 (TCP), LDS, and the TCP-GL1 interface.

Memory chart - instruction cache#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
ICache Utilization	Percentage of shader busy cycles spent actively servicing instruction fetch requests. High utilization indicates the instruction cache is keeping pace with shader execution. Low utilization may indicate instruction cache misses causing stalls, or idle shaders.	Percent
ICache Hit Rate	Percentage of instruction cache accesses that are serviced from cache without fetching from the GL1 cache. High hit rates indicate good instruction locality. Low hit rates may indicate large kernels exceeding instruction cache capacity or divergent branches.	Percent
ICache Miss Rate	Percentage of instruction cache accesses that miss and require fetching from the GL1 cache. High miss rates increase instruction fetch latency and may cause shader stalls. Consider reducing kernel code size or improving branch coherence.	Percent
ICache Requests	Count of instruction-cache (SQC ICache) requests issued, per normalization unit. Formula: SQC_ICACHE_REQ_sum / $denom	Requests per Normalization Unit
ICache Request Stall Rate	Percent of shader busy cycles where the instruction cache input interface was stalled (valid without ready). Formula: 100 * SQC_ICACHE_INPUT_VALID_READYB_sum / SQ_BUSY_CYCLES_sum.	Percent
ICache-GL1 Read Bandwidth	Bytes per second of read traffic from the instruction path toward GL1 (texture-cache path instruction requests), using 128 B per SQC_TC_INST_REQ event.	Bytes/s

Memory chart - scalar data cache#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
Dcache Utilization	Percentage of shader busy cycles spent actively servicing scalar data cache requests. The scalar data cache holds uniform data accessed by scalar instructions. High utilization indicates active use of scalar memory operations.	Percent
Dcache Hit Rate	Percentage of scalar data cache accesses that hit in cache. High hit rates indicate efficient reuse of constant and uniform data. Low hit rates may indicate excessive unique constant data or poor temporal locality.	Percent
Dcache Requests	Count of scalar data-cache (SQC DCache) requests, per normalization unit. Formula: SQC_DCACHE_REQ_sum / $denom	Requests per Normalization Unit
Dcache Request Stall Rate	Percent of shader busy cycles where the scalar data cache input interface was stalled. Formula: 100 * SQC_DCACHE_INPUT_VALID_READYB_sum / SQ_BUSY_CYCLES_sum.	Percent
Dcache-GL1 Read Bandwidth	Bytes per second of scalar read traffic from SQC toward GL1 (SQC_TC_DATA_READ_REQ at 128 B per request).	Bytes/s

Memory chart - TCP cache (GL0 vector cache)#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
GL0 Cache Hit Rate (TCP Cache)	Percentage of GL0 vector cache (TCP Cache) requests serviced from cache. TCP Cache is the first-level cache for vector memory operations. Higher hit rates reduce traffic to the GL1 cache and lower memory access latency for vector operations.	Percent
TCP Total Requests	Total TCP (vector L0) requests per normalization unit (reads, writes and related traffic aggregated in TCP_REQ_sum).	Requests per Normalization Unit
TCP Read Requests	TCP read requests per normalization unit (TCP_REQ_READ_sum).	Requests per Normalization Unit
TCP Write Requests	TCP write requests per normalization unit (TCP_REQ_WRITE_sum).	Requests per Normalization Unit
TCP Miss Requests	TCP requests that missed in the L0 vector cache and required a fill from GL1C (TCP_REQ_MISS_sum / $denom).	Requests per Normalization Unit

Memory chart - LDS (local data share)#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
LDS Atomic Instructions	Number of atomic operations executed on Local Data Share memory. Atomic operations provide thread-safe read-modify-write operations but may serialize if multiple work-items access the same address.	Instructions per Normalization Unit
LDS Bank Conflict Rate	Percentage of LDS accesses that experienced bank conflicts. Conflicts occur when multiple work-items in a wavefront access different addresses in the same LDS bank. High conflict rates reduce effective LDS bandwidth. Restructuring data layouts can help.	Percent
LDS Estimated Bandwidth	Estimated achieved bandwidth for Local Data Share operations. This reflects the effective throughput after accounting for bank conflicts. RDNA 3.5 provides 32 LDS banks per compute unit for high-bandwidth shared memory access.	Bytes/s
LDS Instructions	Total LDS (Local Data Share) instructions executed per normalization unit (SQ_INSTS_LDS_sum).	Instructions per Normalization Unit
LDS Instruction Cycles	Cycles spent executing LDS instructions per normalization unit (SQ_INST_CYCLES_LDS_sum).	Cycles per Normalization Unit
LDS Wait Cycles	Cycles waves spent waiting on LDS (wait-state attribution), per normalization unit (SQ_WAIT_INST_LDS_sum).	Cycles per Normalization Unit

Memory chart - TCP-GL1 interface#

RDNA 3.5 (gfx115x)

Metric	Description	Unit
TCP-GL1 Read Requests	Read requests from TCP to GL1C (miss/refill path), per normalization unit.	Requests per Normalization Unit
TCP-GL1 Write Requests	Write-related requests from TCP toward GL1C, per normalization unit.	Requests per Normalization Unit
TCP-GL1 Read Bandwidth	Bytes per second for TCP→GL1 read traffic (64 B per TCP_GL1_REQ_READ event).	Bytes/s
TCP-GL1 Write Bandwidth	Bytes per second for TCP→GL1 write traffic (64 B per TCP_GL1_REQ_WRITE event).	Bytes/s

GL0 (TCP Vector Cache)

Contents

GL0 (TCP Vector Cache)#

GL0 cache panels#

GL0 utilization#

GL0 request statistics#

GL0 cache performance#

GL0-GL1 interface#

GL0 stalls#

Memory chart: path up to GL1#

Memory chart - instruction cache#

Memory chart - scalar data cache#

Memory chart - TCP cache (GL0 vector cache)#

Memory chart - LDS (local data share)#

Memory chart - TCP-GL1 interface#