Improve latch-based L0 cache for synthesis & physical implementation #31
+59
−50
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the latch-based snitch L0 cache implementation which is used when
CFG.EARLY_LATCHis enabled.Up to now, the latch-memory often required (practically) infeasible timing constraints. This PR rewrites the latch cache to arrive at virtually the same timing constraints on the setup window as the flip-flop cache while using less area and causing significantly less leakage in all reasonable configurations.
The first issue with the current implementation is that it transparent-low latches to replace the posedge-triggered flip-flops used in the FF version of the L0 cache. Since the latches' gate pin is driven by a clock gate, this introduces a non-obvious critical timing condition: Since the latches' gate pin must be stable before the negative clock cycle, the clock gate's output must be calculated and latched within half of a clock cycle since its inputs (
validate_strb[i]) depend on posedge-triggered flip flops and module inputs (out_rsp_id_i), which in turn depends on checking for a prefetch hit. This can lead to the path becoming critical for the entire cache.The second issue with the current implementation is that the prefetch/refill path to the L0 storage elements (latches and flipflops) is oftentimes tight if not critical, especially in low-power implementation scenarios where the refilling memory (the L1 cache) has slow (e.g. more than half a clock period) CK -> Q timing. In contrast, the storage elements' read timing is usually much less critical as they are directly fed to a processor core. This forces the implementation to size all L0 storage elements accordingly, often leading to significant leakage and increased drive strength "creep" towards the cores.
Both of these issues are addressed in this pull request; instead of using transparent-low latches, the new implementation uses a posedge-triggered write port flip-flop to latch the refill/prefetch line, and selectively updates the new L0 cache which is implemented using transparent-high latches instead. This style of implementation achieves fundamentally the same cycle latency, and only adds the latches' setup and propagation delay to the cache's read path length. In practice, this is close to negligible.
The first issue of having the clock gate in a (sub-)critical path is fixed by the fact that the latches are transparent-high now; this directly implies we can use the whole clock cycle to set up the enable pin.
The second issue of high drive strength on the main storage elements is mitigated as well. Since the refill/prefetch line is now stored in a write port flip-flop, the latches (being the main storage element) are no longer in a timing-relevant path.
Since typical implementations use 8 (or more) L0 cache lines, these leakage savings are quite noticeable. In our experiments this change alone has reduced the leakage of an entire cluster by over 15% without performance degradation compared to the flip-flop variant. Hold constraints are also no more critical than in the previous implementation as the storage latches hold condition is practically always met (the write port flip flop's output value on changes on a posedge) and next receiving storage element's hold condition edge is typically on the posedge of the next cycle.