Improve latch-based L0 cache for synthesis & physical implementation #31

Scheremo · 2025-08-29T09:21:58Z

This PR changes the latch-based snitch L0 cache implementation which is used when CFG.EARLY_LATCH is enabled.
Up to now, the latch-memory often required (practically) infeasible timing constraints. This PR rewrites the latch cache to arrive at virtually the same timing constraints on the setup window as the flip-flop cache while using less area and causing significantly less leakage in all reasonable configurations.

The first issue with the current implementation is that it transparent-low latches to replace the posedge-triggered flip-flops used in the FF version of the L0 cache. Since the latches' gate pin is driven by a clock gate, this introduces a non-obvious critical timing condition: Since the latches' gate pin must be stable before the negative clock cycle, the clock gate's output must be calculated and latched within half of a clock cycle since its inputs (validate_strb[i]) depend on posedge-triggered flip flops and module inputs (out_rsp_id_i), which in turn depends on checking for a prefetch hit. This can lead to the path becoming critical for the entire cache.

The second issue with the current implementation is that the prefetch/refill path to the L0 storage elements (latches and flipflops) is oftentimes tight if not critical, especially in low-power implementation scenarios where the refilling memory (the L1 cache) has slow (e.g. more than half a clock period) CK -> Q timing. In contrast, the storage elements' read timing is usually much less critical as they are directly fed to a processor core. This forces the implementation to size all L0 storage elements accordingly, often leading to significant leakage and increased drive strength "creep" towards the cores.

Both of these issues are addressed in this pull request; instead of using transparent-low latches, the new implementation uses a posedge-triggered write port flip-flop to latch the refill/prefetch line, and selectively updates the new L0 cache which is implemented using transparent-high latches instead. This style of implementation achieves fundamentally the same cycle latency, and only adds the latches' setup and propagation delay to the cache's read path length. In practice, this is close to negligible.

The first issue of having the clock gate in a (sub-)critical path is fixed by the fact that the latches are transparent-high now; this directly implies we can use the whole clock cycle to set up the enable pin.
The second issue of high drive strength on the main storage elements is mitigated as well. Since the refill/prefetch line is now stored in a write port flip-flop, the latches (being the main storage element) are no longer in a timing-relevant path.

Since typical implementations use 8 (or more) L0 cache lines, these leakage savings are quite noticeable. In our experiments this change alone has reduced the leakage of an entire cluster by over 15% without performance degradation compared to the flip-flop variant. Hold constraints are also no more critical than in the previous implementation as the storage latches hold condition is practically always met (the write port flip flop's output value on changes on a posedge) and next receiving storage element's hold condition edge is typically on the posedge of the next cycle.

Luca Rufer and others added 4 commits July 18, 2025 08:05

Hook up error signals

a57211e

Fix latch memory implementation

7dc228d

Fixup

90cab61

Fix mismatching clk gate enable cycle

89026f5

Scheremo requested review from SamuelRiedel, micprog and paulsc96 as code owners August 29, 2025 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve latch-based L0 cache for synthesis & physical implementation #31

Improve latch-based L0 cache for synthesis & physical implementation #31

Scheremo commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve latch-based L0 cache for synthesis & physical implementation #31

Are you sure you want to change the base?

Improve latch-based L0 cache for synthesis & physical implementation #31

Conversation

Scheremo commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant