Add dimensional explainability to KNN detector#652
Open
Powerscore wants to merge 1 commit intoyzhao062:masterfrom
Open
Add dimensional explainability to KNN detector#652Powerscore wants to merge 1 commit intoyzhao062:masterfrom
Powerscore wants to merge 1 commit intoyzhao062:masterfrom
Conversation
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds dimensional explainability to the KNN detector, providing both visualization and programmatic access to per-sample, per-dimension outlier contributions. The implementation includes an
explain_outlier()method for visualization and aget_outlier_explainability_scores()method for programmatic access to dimensional scores.Motivation
KNN is one of the most widely used outlier detection algorithms in PyOD, but it has lacked interpretability features. While the algorithm can identify outliers, it doesn't explain why a sample is anomalous. This PR addresses that gap by:
Changes Made
Core Implementation (
pyod/models/knn.py)Store Training Data (Line ~194)
self.X_train_ = Xfor explainabilityHelper Method:
_compute_dimensional_scores()(Lines ~283-321)columnsparameterMain Method:
explain_outlier()(Lines ~323-475)Score Access Method:
get_outlier_explainability_scores()(Lines ~277-282)columnsparameterAdded Import (Line 10)
import matplotlib.pyplot as pltExample (
examples/knn_interpretability.py)Created a clean, simple example following COPOD's interpretability example pattern:
cardio.mat(21 features) - demonstrates value for high-dimensional dataAPI Design
The API mirrors COPOD's
explain_outlier()for consistency:explain_outlier()explain_outlier()✓ind, columns, cutoffs, feature_names, file_name, file_typeget_outlier_explainability_scores()✓# pragma: no cover# pragma: no cover✓Usage Example:
Visualization:
Programmatic access to scores:
Technical Details
Algorithm:
Complexity:
X_train_self._cached_dimensional_scores), making interactive exploration nearly instant after the initial computation.self.X_train_) enables explainability but increases memory footprint (O(N×D)). This aligns with COPOD's design and allows for lazy feature-wise distance computation.Design Decisions:
X_train_- Following COPOD's pattern# pragma: no cover- Exclude visualization from test coverageTesting
Following PyOD conventions (see COPOD's
test_copod.pylines 147-149), visualization methods use# pragma: no coverand are demonstrated via examples rather than unit tests.Manual Validation:
Extensively tested with:
Test Results:
pytest pyod/test/test_knn.py -v)python examples/knn_interpretability.py)Backwards Compatibility
No breaking changes to existing API:
X_train_) only created when neededUse Cases
This feature enables:
Research Foundation
This implementation is based on the method described in:
Krenmayr, Lucas and Goldstein, Markus (2023). "Explainable Outlier Detection Using Feature Ranking for k-Nearest Neighbors, Gaussian Mixture Model and Autoencoders." In 15th International Conference on Agents and Artificial Intelligence (ICAART), pp. 245-253. https://doi.org/10.5220/0011631900003411
BibTeX:
This PR implements dimensional feature-ranking for KNN outlier interpretation per the method described in the paper above, and extends PyOD with both visualization and a returned explainability score vector (per-dimension evidence), addressing a gap noted in COPOD's implementation.
Screenshots/Examples
2D Validation Examples
These examples demonstrate the correctness of the dimensional explainability approach on 2D data where the results can be visually verified.
Figure 7.3: 2D k-NN Inlier

Demonstrates how an inlier point has low k-NN scores for both dimensions. The overall k-NN score is low, and both individual dimensions show low anomaly scores, correctly identifying this as a normal sample.
Figure 7.4: 2D k-NN X-Dimension Outlier

Demonstrates how a point outlying only in the X-dimension has a high k-NN score in the X-dimension and a low score in the Y-dimension. This shows the method's ability to isolate anomalies to specific dimensions, which the overall k-NN score alone cannot indicate.
Figure 7.5: 2D k-NN Y-Dimension Outlier

Demonstrates how a point outlying only in the Y-dimension has a high k-NN score in the Y-dimension and a low score in the X-dimension. Note that the overall k-NN score shown is outlying overall but it cannot by itself indicate in which dimension, which has also been demonstrated by the previous figure.
Figure 7.6: 2D k-NN Outlier (Both Dimensions)

Demonstrates how an outlier point has high k-NN scores for both dimensions. This shows the method correctly identifies multi-dimensional anomalies.
Real-World Dataset Examples
However, most real datasets have more than 2 features. Therefore, we demonstrate the application of this explainability technique on the real-world Pima Indians Diabetes Dataset (Smith et al., 1988) after performing Min-Max Scaling on the dataset (hence the k-NN score values per-dimension as well as overall are more compact).
Figure 7.7: Pima k-NN Outlier 1

Demonstrates how the most outlying point is an anomaly mainly because of the Diabetes Pedigree Function and the insulin.
Figure 7.8: Pima k-NN Outlier 2

Demonstrates how the second most outlying point is an anomaly mainly because of the Age and the Skin Thickness, which is a very different reason from the previous outlier. This shows how different outliers can have different dimensional contributions.
Figure 7.9: Pima k-NN Inlier

Demonstrates how an inlier overall is also having low k-NN scores in all individual dimensions, confirming the method's consistency.
Related Work
This PR enhances COPOD's API pattern for dimensional interpretability, and is directly inspired by:
explain_outlier()— API consistency for explainability in PyOD; this PR now completes the TODO of returning programmatic scores.Impact
Benefits:
Backward Compatibility:
Checklist
All Submissions Basics:
All Submissions Cores:
get_outlier_explainability_scores()method intest_knn.py(tests numerical logic)# pragma: no coverand are demonstrated via examples (seetest_copod.pylines 147-149)pytest pyod/test/test_knn.py -v)python examples/knn_interpretability.py)# pragma: no cover(following COPOD pattern)Files Changed
pyod/models/knn.py- Added explainability methodsexamples/knn_interpretability.py- New example filepyod/test/test_knn.py- Added unit test forget_outlier_explainability_scores()method