Open
Conversation
…_pct When --prompt-cache-max-pct is set but --prompt-cache-max-len is not explicitly provided, auto-compute common_tokens as int(prompt_tokens * prompt_cache_max_pct / 100). This makes '--prompt-cache-max-pct 100 --prompt-tokens 8192' do the intuitive thing: build an 8192-token shared prefix so the server can actually cache it. Previously, --prompt-cache-max-len defaulted to 0, which meant TranslationDataset always built an empty shared prefix regardless of --prompt-cache-max-pct, defeating prompt caching. Changed --prompt-cache-max-len default from 0 to None so we can distinguish 'not provided' from 'explicitly set to 0'. Co-authored-by: Aidan Do <aidando73@gmail.com>
|
Cursor Agent can help with this pull request. Just |
| help="Maximum number of shared prefix tokens across requests. " | ||
| "If not specified but --prompt-cache-max-pct is set, auto-computed as " | ||
| "int(prompt_tokens * prompt_cache_max_pct / 100). Defaults to 0 (no shared prefix) " | ||
| "when neither this nor --prompt-cache-max-pct is provided.", |
There was a problem hiding this comment.
Done. The default for --prompt-cache-max-len is back to 0. The auto-computation logic now works as:
- If
--prompt-cache-max-lenis explicitly set to a non-zero value, use it directly. - Otherwise, if
--prompt-cache-max-pctis provided, auto-computecommon_tokens = int(prompt_tokens * prompt_cache_max_pct / 100). - If neither is meaningfully set,
common_tokensstays0.
So --prompt-cache-max-pct 100 --prompt-tokens 8192 works on its own, producing common_tokens = 8192.
Per review feedback, keep the default at 0 instead of None. The auto-computation from --prompt-cache-max-pct now triggers when prompt_cache_max_len is 0 (the default) and prompt_cache_max_pct is set. An explicit non-zero --prompt-cache-max-len still takes precedence. Co-authored-by: Aidan Do <aidando73@gmail.com>
Cleaner idiom: use None to mean 'not provided' rather than overloading 0. Behaviour is identical. Co-authored-by: Aidan Do <aidando73@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Auto-compute prompt cache length in benchmark load test to enable caching.
Previously,
--prompt-cache-max-lendefaulted to 0, causingTranslationDatasetto build an empty shared prefix (common_tokens=0). This meant prompt caching was effectively disabled even when--prompt-cache-max-pctwas set. The fix changes the default of--prompt-cache-max-lentoNoneand, when it's not explicitly provided but--prompt-cache-max-pctis, it calculatescommon_tokensasint(prompt_tokens * prompt_cache_max_pct / 100). This ensures that passing--prompt-cache-max-pct 100 --prompt-tokens 8192now correctly generates a shared prefix for caching.