Skip to content

Conversation

@Scheremo
Copy link

This PR changes the latch-based snitch L0 cache implementation which is used when CFG.EARLY_LATCH is enabled.
Up to now, the latch-memory often required (practically) infeasible timing constraints. This PR rewrites the latch cache to arrive at virtually the same timing constraints on the setup window as the flip-flop cache while using less area and causing significantly less leakage in all reasonable configurations.

The first issue with the current implementation is that it transparent-low latches to replace the posedge-triggered flip-flops used in the FF version of the L0 cache. Since the latches' gate pin is driven by a clock gate, this introduces a non-obvious critical timing condition: Since the latches' gate pin must be stable before the negative clock cycle, the clock gate's output must be calculated and latched within half of a clock cycle since its inputs (validate_strb[i]) depend on posedge-triggered flip flops and module inputs (out_rsp_id_i), which in turn depends on checking for a prefetch hit. This can lead to the path becoming critical for the entire cache.

The second issue with the current implementation is that the prefetch/refill path to the L0 storage elements (latches and flipflops) is oftentimes tight if not critical, especially in low-power implementation scenarios where the refilling memory (the L1 cache) has slow (e.g. more than half a clock period) CK -> Q timing. In contrast, the storage elements' read timing is usually much less critical as they are directly fed to a processor core. This forces the implementation to size all L0 storage elements accordingly, often leading to significant leakage and increased drive strength "creep" towards the cores.

Both of these issues are addressed in this pull request; instead of using transparent-low latches, the new implementation uses a posedge-triggered write port flip-flop to latch the refill/prefetch line, and selectively updates the new L0 cache which is implemented using transparent-high latches instead. This style of implementation achieves fundamentally the same cycle latency, and only adds the latches' setup and propagation delay to the cache's read path length. In practice, this is close to negligible.

The first issue of having the clock gate in a (sub-)critical path is fixed by the fact that the latches are transparent-high now; this directly implies we can use the whole clock cycle to set up the enable pin.
The second issue of high drive strength on the main storage elements is mitigated as well. Since the refill/prefetch line is now stored in a write port flip-flop, the latches (being the main storage element) are no longer in a timing-relevant path.

Since typical implementations use 8 (or more) L0 cache lines, these leakage savings are quite noticeable. In our experiments this change alone has reduced the leakage of an entire cluster by over 15% without performance degradation compared to the flip-flop variant. Hold constraints are also no more critical than in the previous implementation as the storage latches hold condition is practically always met (the write port flip flop's output value on changes on a posedge) and next receiving storage element's hold condition edge is typically on the posedge of the next cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant