Skip to content

Conversation

@pierrelalanne
Copy link
Collaborator

@pierrelalanne pierrelalanne commented Feb 3, 2026

Summary

Add a minimum length filter (secrets must be at least 6 characters) to exclude obvious non-secrets before sending them to the HMSL API. This improves UX by reducing false positive matches from the HMSL database.

Note 1: This commit from another PR will fix the failing test in the CI.
Note 2: in a perfect world, these secrets should never end up in HMSL DB in the first place. Unfortunately this happens. This PR is a client side fix to address this.

How to test

echo -e "abc\n123\nshort\sup3rs3cr3t" > /tmp/secrets.txt
# Both should only prepare or check only one secret
pdm run ggshield hmsl fingerprint /tmp/secrets.txt
pdm run ggshield hmsl check /tmp/secrets.txt

Architecture Overview

After exploring the codebase, here's how secrets flow through HMSL:

Input Sources:
├── check.py          → collect() → prepare() → check_secrets()
├── fingerprint.py    → collect() → prepare() → write_outputs()
└── hashicorp_vault.py → collect_list() → prepare() → check_secrets()

The key file is ggshield/verticals/hmsl/collection.py which contains:

  • collect() - Collects secrets from file/env inputs
  • collect_list() - Collects secrets from list of tuples (used by Vault)
  • prepare() - Hashes secrets and builds the payload

Two Collection Functions

collect() handles two input types:

  • FILE (e.g., secrets.txt): Plain text file with one secret per line. No keys, just values.

    my-api-key-abc123
    database-password-456
    

    → yields SecretWithKey(value="my-api-key-abc123", key=None)

  • ENV (e.g., .env file): Key-value pairs parsed with dotenv.

    DB_PASSWORD=super-secret-123
    API_KEY=sk-live-abc123
    

    → yields SecretWithKey(value="super-secret-123", key="DB_PASSWORD")

collect_list() handles HashiCorp Vault input:

  • Vault stores secrets as key-value pairs in its KV store
  • The Vault API client returns List[Tuple[str, str]] like [("DB_PASSWORD", "secret123"), ...]
  • Both key and value are used for filtering (e.g., skip keys like HOST, PORT)

For our length filter, we only care about the value in all cases.

Existing Filtering: EXCLUDED_KEYS and EXCLUDED_VALUES

The codebase already has filtering mechanisms in ggshield/verticals/hmsl/utils.py. These are separate filters applied to different fields of SecretWithKey:

Filter Applies to Purpose
EXCLUDED_KEYS key (variable name) Skip non-secret env vars like HOST, PORT
EXCLUDED_VALUES value (the secret) Skip placeholder/boolean values like "true", "null"

The filtering happens during collect() and collect_list():

# Both conditions checked (key OR value triggers exclusion)
if key.upper() in EXCLUDED_KEYS or value.lower() in EXCLUDED_VALUES:
    continue

Our new length filter applies to the value only (like EXCLUDED_VALUES).


Implementation Approach

Refactoring: Extract a Unified Validation Utility

Currently, the filtering logic is inline and duplicated in both collect() and collect_list():

Proposal: Extract this into a single validation function in utils.py that handles all exclusion logic, including the new length check.

Usage in collect() and collect_list():

if not should_process_secret(value, key):
    continue

Decisions

Q1: Should the minimum length be configurable?

Decision: No. Hardcoded as a constant (MIN_SECRET_LENGTH = 6) that maintainers can modify in code, but not exposed as a CLI option.

Q2: What should the CLI message say?

Decision: No message unless verbose mode is enabled.

Q3: Should this apply to all HMSL commands?

Decision: Yes, apply to all HMSL commands (check, fingerprint, hashicorp_vault).

Q4: Should filtered secrets appear in JSON output?

Decision: No change to JSON output. Filtered secrets are simply not processed, so they won't appear. This is consistent with how EXCLUDED_VALUES works - those values are silently skipped.

Q5: What about the edge case where ALL secrets are filtered?

Decision: Short-circuit. Don't call the API if there are no secrets to check.

Q6: Should we add the length to EXCLUDED_VALUES logic or keep it separate?

Decision: Consolidate into a single should_process_secret() utility function. While EXCLUDED_VALUES (exact match) and length filtering are different criteria, they serve the same purpose: filtering out non-secrets. A unified function:

Q7: Should should_process_secret() return detailed exclusion reasons?

Decision: No, keep it simple with a boolean return.

@pierrelalanne pierrelalanne requested a review from a team as a code owner February 3, 2026 18:59
@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 72.41379% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.97%. Comparing base (37820fd) to head (1f174cf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
ggshield/cmd/hmsl/check.py 33.33% 2 Missing ⚠️
...d/cmd/hmsl/check_secret_manager/hashicorp_vault.py 33.33% 2 Missing ⚠️
ggshield/cmd/hmsl/fingerprint.py 33.33% 2 Missing ⚠️
ggshield/verticals/hmsl/collection.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
- Coverage   92.05%   91.97%   -0.09%     
==========================================
  Files         144      144              
  Lines        6255     6277      +22     
==========================================
+ Hits         5758     5773      +15     
- Misses        497      504       +7     
Flag Coverage Δ
unittests 91.97% <72.41%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pierrelalanne pierrelalanne changed the title HMSL - Add a length filter client side Darft :: HMSL - Add a length filter client side Feb 3, 2026
@pierrelalanne pierrelalanne force-pushed the feat/hmsl-length-filter branch 2 times, most recently from 473e171 to 8e78ee4 Compare February 4, 2026 21:25
@pierrelalanne pierrelalanne self-assigned this Feb 5, 2026
@pierrelalanne pierrelalanne changed the title Darft :: HMSL - Add a length filter client side HMSL - Add a length filter client side Feb 5, 2026
@xblanchot-gg
Copy link
Contributor

Looks good to me thanks ! I never heard about this hashicorp vault thing, I don't really understand what it does, but that's unrelated

@pierrelalanne
Copy link
Collaborator Author

Looks good to me thanks ! I never heard about this hashicorp vault thing, I don't really understand what it does, but that's unrelated

At some point we built a mini integration with Vault to pull your secrets hash and check them against HMSL IIRC. This could totally be removed as I think it is obsolete and not used.

Copy link
Collaborator

@sevbch sevbch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a comment to simplify tests.

Also, as a more general remark: the PR description is way too verbose IMO, especially for such a small and scoped PR. I personally only read the summary part.

The PR template should be enough info to provide to the reviewer.

@pierrelalanne pierrelalanne force-pushed the feat/hmsl-length-filter branch 2 times, most recently from c2950e3 to f272b71 Compare February 9, 2026 18:08
@pierrelalanne
Copy link
Collaborator Author

LGTM, just a comment to simplify tests.

Also, as a more general remark: the PR description is way too verbose IMO, especially for such a small and scoped PR. I personally only read the summary part.

The PR template should be enough info to provide to the reviewer.

Agree and disagree.

  • Agree: I edited a bit to remove code snippets (this is contained in the diffs), remove the implementation plan (this is visible in the commit history) to make it shorter overall.
  • Disagree: With AI assisted coding: review becomes the bottleneck, and you may end up reviewing parts of the code you don't know very well. I think it is useful to present the decisions made to implement the feature, remind the reviewer about the code architecture and steps to implement. With this, you can almost not read the diff.

Anyway, this is probably another topic, and it's hard to know what the best practice is in this changing environment. Thanks for the feedback 🙂

@pierrelalanne pierrelalanne requested a review from sevbch February 9, 2026 18:13
@agateau-gg
Copy link
Collaborator

With this, you can almost not read the diff.

I certainly want to read the diff. The PR description should help me do this, not replace the diff. It should add what I can't get from reading the diff:

  • The motivation behind the change
  • The main architectural changes
  • Are the separate commits additive changes that make sense reviewing one by one or do I risk commenting on issues that are fixed in later commits?
  • Where to start: give me a guided tour of the changes. Maybe many changes are a reaction to an API change (the signature of a function changed for example). It would make the review easier to point out where the API change was made so that I can start reviewing this, and quickly go over where the changes are "boilerplaty".

I certainly don't want to read a wall of text. Before AI we would get one-line descriptions. Now with AI we get generated walls of text that we are not sure the author actually wrote or even read, but definitely didn't take the time to trim. The only conclusion is that people don't like the act of writing prose 😢.

…) utility

Consolidate duplicated filtering logic from collect() and collect_list()
into a single utility function. This is a pure refactor with no behavior
change, preparing for the addition of a minimum length filter.
Secrets shorter than 6 characters are now filtered out before being
sent to the HMSL API. This reduces false positives from the HMSL
database for obvious non-secrets.

- Add MIN_SECRET_LENGTH = 6 constant
- Add length check to should_process_secret()
- Clean up EXCLUDED_VALUES (short values now filtered by length)
- Add comprehensive tests for should_process_secret()
- Update test fixtures to use secrets >= 6 chars
Skip API calls when all secrets have been filtered out. This avoids
unnecessary network requests and provides a clear message to the user.
Add typical placeholder values that are obviously not real secrets:
changeme, example, placeholder, redacted, secret, xxxxxx, your_secret_here.

Note: 'password' is intentionally not excluded as it could be a legitimate
weak secret that users may want to check for leaks. We don't want to jump
too much into the detection logic with the excluded values.
Add functional tests to verify:
- Secrets < 6 characters are filtered out
- Excluded placeholder values are filtered out
- Short-circuit behavior when all secrets are filtered
- Excluded keys (HOST, PORT) are filtered in env files
- hmsl check short-circuits when all secrets filtered
@pierrelalanne pierrelalanne force-pushed the feat/hmsl-length-filter branch from f272b71 to 1f174cf Compare February 11, 2026 13:17
@pierrelalanne pierrelalanne merged commit c907f5b into main Feb 11, 2026
28 of 30 checks passed
@pierrelalanne pierrelalanne deleted the feat/hmsl-length-filter branch February 11, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants