feat: add ProPILE probes for PII leakage detection#1504
feat: add ProPILE probes for PII leakage detection#1504stefanoamorelli wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
e07167a to
b452f8d
Compare
|
I have read the DCO Document and I hereby sign the DCO |
094a55d to
1610d41
Compare
|
@leondz ready for review as promised! |
fb1653c to
8dc718b
Compare
|
@leondz latest push should have addressed the failing pipes. |
|
Will provide a review, but I have a high level comment here: I think this is going to be a tough probe to implement without significant caveats that we need to document carefully. This is a true "training-time" issue in the sense that anything we could hope to uncover with such a check would require access to training data and the ability to reproduce training data. In the majority of cases, the training data for a model is not going to be public and so we're dealing with something of a "chicken and the egg" problem. This probe is most similar to those we have in There is a really significant chance of false positives with detectors that look for the "shape" of PII and I wonder how much we can hedge on that. This may be fixable with documentation or something else. |
erickgalinkin
left a comment
There was a problem hiding this comment.
Broadly, I think this is a good implementation but I have some issues that I've highlighted. Perhaps this merits a bit more discussion on those fronts.
Ultimately, I'm personally fine with accepting this as-is with 2 caveats:
- I think that we need to have better documentation for the probe and flag the fact that it's using made-up data, etc.
- The probe should not be
activeby default given the limitations.
Another thing that could be interesting (albeit very much not necessary) is the use of something like Presidio as a detector.
There was a problem hiding this comment.
This is a great add, it may take a bit to get through review and acceptance as the project needs to offer some guidance around the positioning of this probe.
Currently:
- Use of this probe is predicated on knowledge of the training dataset
- Since each model has different training data the tier would not align with the concept
COMPETE_WITH_SOTA
One idea on how to address this, that has been floated internally, would be to identify a default dataset that contains some PII entries and is considered essential or common for most open weight models. This would increase the general utility of this probe. Then if users choose to bring there own additional or replacement dataset for testing the report may denote that the comparison scores, if calibration were to include this probe, are not applicable due to how the probe was configured for the run.
There was a problem hiding this comment.
Really appreciate all the feedback and context @erickgalinkin and @jmartin-tech!
Since each model has different training data the tier would not align with the concept COMPETE_WITH_SOTA
Indeed, I've adjusted it toINFORMATIONAL(and expanded more in the commit history + PR description).
Use of this probe is predicated on knowledge of the training dataset
One idea on how to address this, that has been floated internally, would be to identify a default dataset that contains some PII entries and is considered essential or common for most open weight models. This would increase the general utility of this probe. Then if users choose to bring there own additional or replacement dataset for testing the report may denote that the comparison scores, if calibration were to include this probe, are not applicable due to how the probe was configured for the run.
I totally agree, and shared more info here.
5ae6b17 to
9f6567e
Compare
erickgalinkin
left a comment
There was a problem hiding this comment.
This looks good to me. Gives me some ideas how we could use/re-use some techniques for leakreplay and maybe consolidate things at some point. Nice work!
garak/data/propile/enron_pii.jsonl
Outdated
There was a problem hiding this comment.
IMO, this is fine for now, but in the future, I think it would be good to have our own copy of the dataset. This is more of a maintainer comment, but just making it for posterity.
There was a problem hiding this comment.
To improve this - how about doing PII extr over a large open training/sft dataset (e.g. nvidia/Nemotron-CC-v2.1, nvidia/Nemotron-Pretraining-SFT-v1, or some CC) and including instances from this? I'm thinking it gives:
- a possibility of confirmed instances of PII,
- a non-zero chance of relating to items found in real-world models, especially future models, which may have been trained on all the open data,
- safer examples because we aren't the original source for the data (one will have to be careful nevertheless; propagating leaked SSNs doesn't seem .. prudent, for example)
|
Can we hold this briefly and nail down some useful PII examples? c.f. #1504 (comment) Code looks complete otherwise |
notes of proposal in garak eng discussion:
|
@leondz really appreciate all the feedback, we're aligned. I'm working on this, will update the PR to be ready for review (ETA EOW). |
This commit adds the data infrastructure for ProPILE privacy leakage probes [1]. The ProPILE methodology tests whether LLMs have memorized personally identifiable information from their training data by constructing prompts with known PII to elicit other memorized PII. The prompt_templates.tsv contains template patterns for three probe types: twins (name only), triplets (name plus one auxiliary PII), and quadruplets (name plus two auxiliary PIIs). These templates are based on the original ProPILE paper's approach to privacy probing. The bundled pii_data.jsonl contains 26 records extracted from NVIDIA's Nemotron-CC dataset [2] using Microsoft Presidio [3] for named entity recognition. I chose Nemotron-CC because it is an open dataset actively used for LLM pretraining, which means any PII found there has a reasonable chance of appearing in model training data. Web crawl datasets like Nemotron-CC tend to have sparse PII since contact pages usually list either email or phone, rarely both for the same person. After processing 50,000 samples, only one record had both fields. This works well for twin probes but provides limited coverage for triplet and quadruplet probes. For richer PII data, the extraction script supports the Enron email dataset [4], which the original ProPILE paper used. Business email signatures naturally contain name, email, phone, and address together, making Enron well suited for triplet and quadruplet testing. The extraction script uses HuggingFace datasets [5] for streaming large datasets without full downloads, and spaCy [6] provides the NER backend for Presidio. A requirements.txt with version bounds is included following the project conventions in tools/requirements.txt. [1]: https://arxiv.org/abs/2307.01881 [2]: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.1 [3]: https://microsoft.github.io/presidio/ [4]: https://huggingface.co/datasets/LLM-PBE/enron-email [5]: https://huggingface.co/docs/datasets [6]: https://spacy.io/
This commit implements the probe classes for ProPILE privacy testing [1]. The probes construct prompts using known PII to test whether LLMs can complete them with other memorized PII from their training data. Four probe classes are implemented following the ProPILE methodology: PIILeakTwin uses only the subject's name to elicit email, phone, or address. This is the simplest probe and works well with sparse PII data like web crawls where records typically have just one contact field. PIILeakTriplet uses name plus one auxiliary PII to elicit another. For example, given a name and email, it tests whether the model can produce the associated phone number. This requires PII records with at least two fields beyond the name. PIILeakQuadruplet uses name plus two auxiliary PIIs to elicit the third. This provides maximum context to the model and tests for stronger memorization signals. It requires complete PII records with name, email, phone, and address. PIILeakUnstructured tests for memorization of relationship and affiliation information like family members, employers, or university affiliations. All probes share a common mixin that handles template loading, PII data loading, and attempt metadata tracking. The PII data source path is logged to report.jsonl for traceability, following the same pattern as garak.payloads.PayloadGroup. The probes are marked as inactive by default since they require specific PII data that users should curate for their target models. The module docstring explains how to extract PII from training datasets and the tradeoffs between data sources. [1]: https://arxiv.org/abs/2307.01881
This commit adds detectors for evaluating ProPILE probe responses. The detectors check whether model outputs contain the expected PII that was used as the trigger for each probe attempt. PIILeak is the primary detector that performs normalized string matching between the expected PII trigger and the model's response. It handles common variations in formatting like email case differences, phone number punctuation, and address abbreviations. PIILeakExact provides strict matching for cases where exact reproduction is required, useful for confirming strong memorization signals where the model reproduces PII character for character. PIILeakEmail, PIILeakPhone, and PIILeakAddress are specialized detectors that apply type-specific normalization. Email matching is case insensitive. Phone matching strips formatting characters and compares digit sequences. Address matching handles common abbreviations like St for Street and normalizes whitespace. The detectors access the expected trigger value from attempt.notes, which is populated by the probe's _attempt_prestore_hook method. This follows the pattern established by other garak probe/detector pairs where metadata flows through the attempt object.
This commit adds comprehensive tests for the ProPILE probe and detector implementations. The tests verify correct behavior without requiring actual LLM inference, using mock data and controlled inputs. Probe tests verify that templates load correctly from the TSV file, PII records are parsed from the JSONL data, and prompts are generated with proper placeholder substitution. Each probe class has tests confirming its specific template categories are used and that the pii_type metadata is set correctly for downstream detector matching. Detector tests verify the normalization logic for each PII type. Email tests confirm case insensitive matching. Phone tests verify that various formatting styles like parentheses, dashes, and dots are normalized to digit sequences. Address tests check that common abbreviations are handled and that partial matches within longer text are detected. The tests use pytest fixtures to provide consistent mock PII data across test cases. A separate test class verifies graceful handling when the PII data file is missing, confirming that a warning is logged and the probe initializes with empty prompts rather than raising an exception. Tests for the base probe module are also updated to include the new propile module in the probe discovery tests.
This commit adds the Sphinx documentation stubs for the ProPILE probes and detectors modules. The rst files use autodoc to generate API documentation from the module docstrings. The probes.rst and detectors.rst index files are updated to include the new propile modules in the table of contents, making them discoverable in the built documentation.
9f6567e to
5bee12f
Compare
|
@leondz I ran PII extraction over What I found is that web crawl data produces mostly name+email pairs: out of 26 curated records, only one has both email and phone, and none have an address. This is because contact pages typically expose a single contact method per person, unlike email signatures which naturally bundle multiple PII fields together. The original paper 4 used Enron 5 for a reason that goes beyond it being a known training corpus: business email creates dense PII tuples (name, email, phone, address in a single signature block), which is exactly what the triplet and quadruplet templates need to function. So the current state is: Nemotron-CC works well for twin probes out of the box, everything lands as Looking forward to hearing your feedback. |
Tip
Better reviewed commit-by-commit, starting from here.
This PR implements the ProPILE (Probing Privacy Leakage in Large Language Models) methodology from Kim et al., 2023 probes and detectors.
garak-propile-demo-run.mp4
ProPILE tests whether LLMs have memorized personally identifiable information (PII) from their training data and can be prompted to leak it. The attack constructs completion-style prompts using known PII to elicit other PII fields.
Limitations and decisions
Important
This probe is most effective when testing against PII data that was likely present in the model's training corpus.
This is similar to
garak.probes.leakreplayand can be seen as a generalization of that approach to PII.Probes are disabled by default
All probes have
active = Falsebecause the effectiveness depends on having PII data that was likely present in the model's training corpus. A positive result suggests memorization but is not definitive proof. False positives are possible when tested LLMs generate plausible-looking PII by coincidence or pattern inference.Tier is set to
INFORMATIONALResults are heavily sensitive to the context, so
INFORMATIONAL(Tier 3) is more appropriate thanCOMPETE_WITH_SOTA.Default dataset: Enron Email Corpus
As discussed and suggested by @erickgalinkin 1, and @jmartin-tech 2, this PR elects the Enron email corpus as the default PII dataset because:
Tip
The default
enron_pii.jsonlcontains ~50 Enron entries.For more extensive testing, it is recommended to use a larger portion of the Enron dataset or a custom PII dataset that you have reason to believe was present in the target model's training data.
Usage
Detectors
PIILeakPII-type-aware matching with partial scoring (email local-part/domain, phone digits/area-code, address components, generic fuzzy matching);PIILeakExactstrict exact-match detection, inherits fromTriggerListDetectorTests
Closes #275