HMSL - Add a length filter client side #1170

pierrelalanne · 2026-02-03T18:59:43Z

Summary

Add a minimum length filter (secrets must be at least 6 characters) to exclude obvious non-secrets before sending them to the HMSL API. This improves UX by reducing false positive matches from the HMSL database.

Note 1: This commit from another PR will fix the failing test in the CI.
Note 2: in a perfect world, these secrets should never end up in HMSL DB in the first place. Unfortunately this happens. This PR is a client side fix to address this.

How to test

echo -e "abc\n123\nshort\sup3rs3cr3t" > /tmp/secrets.txt
# Both should only prepare or check only one secret
pdm run ggshield hmsl fingerprint /tmp/secrets.txt
pdm run ggshield hmsl check /tmp/secrets.txt

Architecture Overview

After exploring the codebase, here's how secrets flow through HMSL:

Input Sources:
├── check.py          → collect() → prepare() → check_secrets()
├── fingerprint.py    → collect() → prepare() → write_outputs()
└── hashicorp_vault.py → collect_list() → prepare() → check_secrets()

The key file is ggshield/verticals/hmsl/collection.py which contains:

collect() - Collects secrets from file/env inputs
collect_list() - Collects secrets from list of tuples (used by Vault)
prepare() - Hashes secrets and builds the payload

Two Collection Functions

collect() handles two input types:

FILE (e.g., secrets.txt): Plain text file with one secret per line. No keys, just values.
```
my-api-key-abc123
database-password-456
```
→ yields SecretWithKey(value="my-api-key-abc123", key=None)
ENV (e.g., .env file): Key-value pairs parsed with dotenv.
```
DB_PASSWORD=super-secret-123
API_KEY=sk-live-abc123
```
→ yields SecretWithKey(value="super-secret-123", key="DB_PASSWORD")

collect_list() handles HashiCorp Vault input:

Vault stores secrets as key-value pairs in its KV store
The Vault API client returns List[Tuple[str, str]] like [("DB_PASSWORD", "secret123"), ...]
Both key and value are used for filtering (e.g., skip keys like HOST, PORT)

For our length filter, we only care about the value in all cases.

Existing Filtering: EXCLUDED_KEYS and EXCLUDED_VALUES

The codebase already has filtering mechanisms in ggshield/verticals/hmsl/utils.py. These are separate filters applied to different fields of SecretWithKey:

Filter	Applies to	Purpose
`EXCLUDED_KEYS`	`key` (variable name)	Skip non-secret env vars like `HOST`, `PORT`
`EXCLUDED_VALUES`	`value` (the secret)	Skip placeholder/boolean values like `"true"`, `"null"`

The filtering happens during collect() and collect_list():

# Both conditions checked (key OR value triggers exclusion)
if key.upper() in EXCLUDED_KEYS or value.lower() in EXCLUDED_VALUES:
    continue

Our new length filter applies to the value only (like EXCLUDED_VALUES).

Implementation Approach

Refactoring: Extract a Unified Validation Utility

Currently, the filtering logic is inline and duplicated in both collect() and collect_list():

Proposal: Extract this into a single validation function in utils.py that handles all exclusion logic, including the new length check.

Usage in collect() and collect_list():

if not should_process_secret(value, key):
    continue

Decisions

Q1: Should the minimum length be configurable?

Decision: No. Hardcoded as a constant (MIN_SECRET_LENGTH = 6) that maintainers can modify in code, but not exposed as a CLI option.

Q2: What should the CLI message say?

Decision: No message unless verbose mode is enabled.

Q3: Should this apply to all HMSL commands?

Decision: Yes, apply to all HMSL commands (check, fingerprint, hashicorp_vault).

Q4: Should filtered secrets appear in JSON output?

Decision: No change to JSON output. Filtered secrets are simply not processed, so they won't appear. This is consistent with how EXCLUDED_VALUES works - those values are silently skipped.

Q5: What about the edge case where ALL secrets are filtered?

Decision: Short-circuit. Don't call the API if there are no secrets to check.

Q6: Should we add the length to `EXCLUDED_VALUES` logic or keep it separate?

Decision: Consolidate into a single should_process_secret() utility function. While EXCLUDED_VALUES (exact match) and length filtering are different criteria, they serve the same purpose: filtering out non-secrets. A unified function:

Q7: Should `should_process_secret()` return detailed exclusion reasons?

Decision: No, keep it simple with a boolean return.

codecov · 2026-02-03T19:02:45Z

Codecov Report

❌ Patch coverage is 72.41379% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.97%. Comparing base (37820fd) to head (1f174cf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
ggshield/cmd/hmsl/check.py	33.33%	2 Missing ⚠️
...d/cmd/hmsl/check_secret_manager/hashicorp_vault.py	33.33%	2 Missing ⚠️
ggshield/cmd/hmsl/fingerprint.py	33.33%	2 Missing ⚠️
ggshield/verticals/hmsl/collection.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
- Coverage   92.05%   91.97%   -0.09%     
==========================================
  Files         144      144              
  Lines        6255     6277      +22     
==========================================
+ Hits         5758     5773      +15     
- Misses        497      504       +7

Flag	Coverage Δ
unittests	`91.97% <72.41%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xblanchot-gg · 2026-02-06T09:52:39Z

Looks good to me thanks ! I never heard about this hashicorp vault thing, I don't really understand what it does, but that's unrelated

pierrelalanne · 2026-02-06T15:07:48Z

Looks good to me thanks ! I never heard about this hashicorp vault thing, I don't really understand what it does, but that's unrelated

At some point we built a mini integration with Vault to pull your secrets hash and check them against HMSL IIRC. This could totally be removed as I think it is obsolete and not used.

sevbch

LGTM, just a comment to simplify tests.

Also, as a more general remark: the PR description is way too verbose IMO, especially for such a small and scoped PR. I personally only read the summary part.

The PR template should be enough info to provide to the reviewer.

tests/unit/verticals/hmsl/test_hmsl_utils.py

pierrelalanne · 2026-02-09T18:13:34Z

LGTM, just a comment to simplify tests.

Also, as a more general remark: the PR description is way too verbose IMO, especially for such a small and scoped PR. I personally only read the summary part.

The PR template should be enough info to provide to the reviewer.

Agree and disagree.

Agree: I edited a bit to remove code snippets (this is contained in the diffs), remove the implementation plan (this is visible in the commit history) to make it shorter overall.
Disagree: With AI assisted coding: review becomes the bottleneck, and you may end up reviewing parts of the code you don't know very well. I think it is useful to present the decisions made to implement the feature, remind the reviewer about the code architecture and steps to implement. With this, you can almost not read the diff.

Anyway, this is probably another topic, and it's hard to know what the best practice is in this changing environment. Thanks for the feedback 🙂

agateau-gg · 2026-02-10T08:20:16Z

With this, you can almost not read the diff.

I certainly want to read the diff. The PR description should help me do this, not replace the diff. It should add what I can't get from reading the diff:

The motivation behind the change
The main architectural changes
Are the separate commits additive changes that make sense reviewing one by one or do I risk commenting on issues that are fixed in later commits?
Where to start: give me a guided tour of the changes. Maybe many changes are a reaction to an API change (the signature of a function changed for example). It would make the review easier to point out where the API change was made so that I can start reviewing this, and quickly go over where the changes are "boilerplaty".

I certainly don't want to read a wall of text. Before AI we would get one-line descriptions. Now with AI we get generated walls of text that we are not sure the author actually wrote or even read, but definitely didn't take the time to trim. The only conclusion is that people don't like the act of writing prose 😢.

…) utility Consolidate duplicated filtering logic from collect() and collect_list() into a single utility function. This is a pure refactor with no behavior change, preparing for the addition of a minimum length filter.

Secrets shorter than 6 characters are now filtered out before being sent to the HMSL API. This reduces false positives from the HMSL database for obvious non-secrets. - Add MIN_SECRET_LENGTH = 6 constant - Add length check to should_process_secret() - Clean up EXCLUDED_VALUES (short values now filtered by length) - Add comprehensive tests for should_process_secret() - Update test fixtures to use secrets >= 6 chars

Skip API calls when all secrets have been filtered out. This avoids unnecessary network requests and provides a clear message to the user.

Add typical placeholder values that are obviously not real secrets: changeme, example, placeholder, redacted, secret, xxxxxx, your_secret_here. Note: 'password' is intentionally not excluded as it could be a legitimate weak secret that users may want to check for leaks. We don't want to jump too much into the detection logic with the excluded values.

Add functional tests to verify: - Secrets < 6 characters are filtered out - Excluded placeholder values are filtered out - Short-circuit behavior when all secrets are filtered - Excluded keys (HOST, PORT) are filtered in env files - hmsl check short-circuits when all secrets filtered

pierrelalanne requested a review from a team as a code owner February 3, 2026 18:59

pierrelalanne changed the title ~~HMSL - Add a length filter client side~~ Darft :: HMSL - Add a length filter client side Feb 3, 2026

pierrelalanne force-pushed the feat/hmsl-length-filter branch 2 times, most recently from 473e171 to 8e78ee4 Compare February 4, 2026 21:25

pierrelalanne self-assigned this Feb 5, 2026

pierrelalanne requested review from amascia-gg and xblanchot-gg February 5, 2026 10:41

pierrelalanne changed the title ~~Darft :: HMSL - Add a length filter client side~~ HMSL - Add a length filter client side Feb 5, 2026

xblanchot-gg approved these changes Feb 6, 2026

View reviewed changes

sevbch reviewed Feb 9, 2026

View reviewed changes

tests/unit/verticals/hmsl/test_hmsl_utils.py Outdated Show resolved Hide resolved

pierrelalanne force-pushed the feat/hmsl-length-filter branch 2 times, most recently from c2950e3 to f272b71 Compare February 9, 2026 18:08

pierrelalanne requested a review from sevbch February 9, 2026 18:13

sevbch approved these changes Feb 10, 2026

View reviewed changes

pierrelalanne added 5 commits February 10, 2026 21:50

feat(hmsl): short-circuit when no secrets to process

14660b2

Skip API calls when all secrets have been filtered out. This avoids unnecessary network requests and provides a clear message to the user.

pierrelalanne force-pushed the feat/hmsl-length-filter branch from f272b71 to 1f174cf Compare February 11, 2026 13:17

pierrelalanne merged commit c907f5b into main Feb 11, 2026
28 of 30 checks passed

pierrelalanne deleted the feat/hmsl-length-filter branch February 11, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMSL - Add a length filter client side #1170

HMSL - Add a length filter client side #1170

Uh oh!

pierrelalanne commented Feb 3, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

xblanchot-gg commented Feb 6, 2026

Uh oh!

pierrelalanne commented Feb 6, 2026

Uh oh!

sevbch left a comment

Uh oh!

Uh oh!

pierrelalanne commented Feb 9, 2026

Uh oh!

agateau-gg commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HMSL - Add a length filter client side #1170

HMSL - Add a length filter client side #1170

Uh oh!

Conversation

pierrelalanne commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

Architecture Overview

Two Collection Functions

Existing Filtering: EXCLUDED_KEYS and EXCLUDED_VALUES

Implementation Approach

Refactoring: Extract a Unified Validation Utility

Decisions

Q1: Should the minimum length be configurable?

Q2: What should the CLI message say?

Q3: Should this apply to all HMSL commands?

Q4: Should filtered secrets appear in JSON output?

Q5: What about the edge case where ALL secrets are filtered?

Q6: Should we add the length to EXCLUDED_VALUES logic or keep it separate?

Q7: Should should_process_secret() return detailed exclusion reasons?

Uh oh!

codecov bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xblanchot-gg commented Feb 6, 2026

Uh oh!

pierrelalanne commented Feb 6, 2026

Uh oh!

sevbch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pierrelalanne commented Feb 9, 2026

Uh oh!

agateau-gg commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pierrelalanne commented Feb 3, 2026 •

edited

Loading

Q6: Should we add the length to `EXCLUDED_VALUES` logic or keep it separate?

Q7: Should `should_process_secret()` return detailed exclusion reasons?

codecov bot commented Feb 3, 2026 •

edited

Loading