-
Notifications
You must be signed in to change notification settings - Fork 187
HMSL - Add a length filter client side #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1170 +/- ##
==========================================
- Coverage 92.05% 91.97% -0.09%
==========================================
Files 144 144
Lines 6255 6277 +22
==========================================
+ Hits 5758 5773 +15
- Misses 497 504 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
473e171 to
8e78ee4
Compare
|
Looks good to me thanks ! I never heard about this hashicorp vault thing, I don't really understand what it does, but that's unrelated |
At some point we built a mini integration with Vault to pull your secrets hash and check them against HMSL IIRC. This could totally be removed as I think it is obsolete and not used. |
sevbch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a comment to simplify tests.
Also, as a more general remark: the PR description is way too verbose IMO, especially for such a small and scoped PR. I personally only read the summary part.
The PR template should be enough info to provide to the reviewer.
c2950e3 to
f272b71
Compare
Agree and disagree.
Anyway, this is probably another topic, and it's hard to know what the best practice is in this changing environment. Thanks for the feedback 🙂 |
I certainly want to read the diff. The PR description should help me do this, not replace the diff. It should add what I can't get from reading the diff:
I certainly don't want to read a wall of text. Before AI we would get one-line descriptions. Now with AI we get generated walls of text that we are not sure the author actually wrote or even read, but definitely didn't take the time to trim. The only conclusion is that people don't like the act of writing prose 😢. |
…) utility Consolidate duplicated filtering logic from collect() and collect_list() into a single utility function. This is a pure refactor with no behavior change, preparing for the addition of a minimum length filter.
Secrets shorter than 6 characters are now filtered out before being sent to the HMSL API. This reduces false positives from the HMSL database for obvious non-secrets. - Add MIN_SECRET_LENGTH = 6 constant - Add length check to should_process_secret() - Clean up EXCLUDED_VALUES (short values now filtered by length) - Add comprehensive tests for should_process_secret() - Update test fixtures to use secrets >= 6 chars
Skip API calls when all secrets have been filtered out. This avoids unnecessary network requests and provides a clear message to the user.
Add typical placeholder values that are obviously not real secrets: changeme, example, placeholder, redacted, secret, xxxxxx, your_secret_here. Note: 'password' is intentionally not excluded as it could be a legitimate weak secret that users may want to check for leaks. We don't want to jump too much into the detection logic with the excluded values.
Add functional tests to verify: - Secrets < 6 characters are filtered out - Excluded placeholder values are filtered out - Short-circuit behavior when all secrets are filtered - Excluded keys (HOST, PORT) are filtered in env files - hmsl check short-circuits when all secrets filtered
f272b71 to
1f174cf
Compare
Summary
Add a minimum length filter (secrets must be at least 6 characters) to exclude obvious non-secrets before sending them to the HMSL API. This improves UX by reducing false positive matches from the HMSL database.
Note 1: This commit from another PR will fix the failing test in the CI.
Note 2: in a perfect world, these secrets should never end up in HMSL DB in the first place. Unfortunately this happens. This PR is a client side fix to address this.
How to test
Architecture Overview
After exploring the codebase, here's how secrets flow through HMSL:
The key file is
ggshield/verticals/hmsl/collection.pywhich contains:collect()- Collects secrets from file/env inputscollect_list()- Collects secrets from list of tuples (used by Vault)prepare()- Hashes secrets and builds the payloadTwo Collection Functions
collect()handles two input types:FILE (e.g.,
secrets.txt): Plain text file with one secret per line. No keys, just values.→ yields
SecretWithKey(value="my-api-key-abc123", key=None)ENV (e.g.,
.envfile): Key-value pairs parsed withdotenv.→ yields
SecretWithKey(value="super-secret-123", key="DB_PASSWORD")collect_list()handles HashiCorp Vault input:List[Tuple[str, str]]like[("DB_PASSWORD", "secret123"), ...]HOST,PORT)For our length filter, we only care about the value in all cases.
Existing Filtering: EXCLUDED_KEYS and EXCLUDED_VALUES
The codebase already has filtering mechanisms in
ggshield/verticals/hmsl/utils.py. These are separate filters applied to different fields ofSecretWithKey:EXCLUDED_KEYSkey(variable name)HOST,PORTEXCLUDED_VALUESvalue(the secret)"true","null"The filtering happens during
collect()andcollect_list():Our new length filter applies to the value only (like
EXCLUDED_VALUES).Implementation Approach
Refactoring: Extract a Unified Validation Utility
Currently, the filtering logic is inline and duplicated in both
collect()andcollect_list():Proposal: Extract this into a single validation function in
utils.pythat handles all exclusion logic, including the new length check.Usage in
collect()andcollect_list():Decisions
Q1: Should the minimum length be configurable?
Decision: No. Hardcoded as a constant (
MIN_SECRET_LENGTH = 6) that maintainers can modify in code, but not exposed as a CLI option.Q2: What should the CLI message say?
Decision: No message unless verbose mode is enabled.
Q3: Should this apply to all HMSL commands?
Decision: Yes, apply to all HMSL commands (
check,fingerprint,hashicorp_vault).Q4: Should filtered secrets appear in JSON output?
Decision: No change to JSON output. Filtered secrets are simply not processed, so they won't appear. This is consistent with how
EXCLUDED_VALUESworks - those values are silently skipped.Q5: What about the edge case where ALL secrets are filtered?
Decision: Short-circuit. Don't call the API if there are no secrets to check.
Q6: Should we add the length to
EXCLUDED_VALUESlogic or keep it separate?Decision: Consolidate into a single
should_process_secret()utility function. WhileEXCLUDED_VALUES(exact match) and length filtering are different criteria, they serve the same purpose: filtering out non-secrets. A unified function:Q7: Should
should_process_secret()return detailed exclusion reasons?Decision: No, keep it simple with a boolean return.