Skip to content

Add dimensional explainability to KNN detector#652

Open
Powerscore wants to merge 1 commit intoyzhao062:masterfrom
Powerscore:feature/knn-explainability
Open

Add dimensional explainability to KNN detector#652
Powerscore wants to merge 1 commit intoyzhao062:masterfrom
Powerscore:feature/knn-explainability

Conversation

@Powerscore
Copy link

Summary

This PR adds dimensional explainability to the KNN detector, providing both visualization and programmatic access to per-sample, per-dimension outlier contributions. The implementation includes an explain_outlier() method for visualization and a get_outlier_explainability_scores() method for programmatic access to dimensional scores.

Motivation

KNN is one of the most widely used outlier detection algorithms in PyOD, but it has lacked interpretability features. While the algorithm can identify outliers, it doesn't explain why a sample is anomalous. This PR addresses that gap by:

  1. Visualizing per-feature contributions via horizontal bar charts
  2. Providing statistical context through percentile cutoff bands
  3. Enabling programmatic access to dimensional scores (completing a TODO from COPOD's implementation)

Changes Made

Core Implementation (pyod/models/knn.py)

  1. Store Training Data (Line ~194)

    • Added self.X_train_ = X for explainability
    • Follows COPOD's pattern
    • Stores a reference to training data (O(N×D) memory) to enable distance-based dimensional scoring. While this increases memory usage, it aligns with COPOD's design for consistency and is necessary for lazy feature-wise distance computation.
  2. Helper Method: _compute_dimensional_scores() (Lines ~283-321)

    • Calculates average absolute distance to k-neighbors for each feature
    • Supports feature subset selection via columns parameter
    • Returns dimensional score vector
  3. Main Method: explain_outlier() (Lines ~323-475)

    • Horizontal bar chart visualization
    • Color-coded bars (Blue: normal, Orange: warning, Red: extreme)
    • Cutoff bands for statistical context
    • Flexible parameters (feature selection, custom cutoffs, file export)
    • Comprehensive docstring
  4. Score Access Method: get_outlier_explainability_scores() (Lines ~277-282)

    • Returns per-dimension explainability scores as a numpy array
    • Enables programmatic access to dimensional contributions
    • Completes the explainability interface (addresses COPOD's TODO)
    • Supports feature subset selection via columns parameter
  5. Added Import (Line 10)

    • import matplotlib.pyplot as plt

Example (examples/knn_interpretability.py)

Created a clean, simple example following COPOD's interpretability example pattern:

  • Uses cardio.mat (21 features) - demonstrates value for high-dimensional data
  • Shows basic usage with default parameters
  • Demonstrates custom cutoffs
  • ~68 lines (consistent with other PyOD examples)

API Design

The API mirrors COPOD's explain_outlier() for consistency:

Feature COPOD KNN (New)
Method name explain_outlier() explain_outlier()
Parameters ind, columns, cutoffs, feature_names, file_name, file_type Same ✓
Visualization Scatter plot Horizontal bars
Returns programmatic scores TODO get_outlier_explainability_scores()
Pragma # pragma: no cover # pragma: no cover

Usage Example:

Visualization:

from pyod.models.knn import KNN
from pyod.utils.data import generate_data

# Fit KNN detector
X_train, _, _, _ = generate_data(n_train=200, n_features=5)
knn = KNN(n_neighbors=10, method='mean', contamination=0.1)
knn.fit(X_train)

# Visualize outlier explanation
knn.explain_outlier(
    ind=42,
    feature_names=['Age', 'Income', 'Credit', 'Debt', 'Savings'],
    cutoffs=[0.90, 0.99],
    file_name='outlier_42',
    file_type='png'
)

Programmatic access to scores:

# Get dimensional explainability scores as numpy array
scores = knn.get_outlier_explainability_scores(ind=42)
print(f"Per-dimension scores: {scores}")
# Can be used for further analysis, custom visualizations, or integration with other tools

Technical Details

Algorithm:

For sample at index ind:
  1. Query k-nearest neighbors from training data
  2. For each dimension d:
     - Compute |X[neighbors, d] - X[ind, d]|
     - Average across k neighbors
     → dim_score[d]
  3. Compute cutoff bands (percentiles across all samples)
  4. Create horizontal bar chart with color coding

Complexity:

  • Space: O(N×D) for storing X_train_
  • Time:
    • First call (with cutoffs): O(N×k×D) to compute statistical bands across the full training set.
    • Subsequent calls: O(k×D) per explanation. Results are cached (self._cached_dimensional_scores), making interactive exploration nearly instant after the initial computation.
  • Memory trade-off: Storing training data (self.X_train_) enables explainability but increases memory footprint (O(N×D)). This aligns with COPOD's design and allows for lazy feature-wise distance computation.

Design Decisions:

  1. On-demand computation - Don't pre-compute/store all dimensional scores
    • Reason: Explainability is used sparingly, saves memory
  2. Store X_train_ - Following COPOD's pattern
    • Reason: Required for dimensional analysis, consistent with PyOD
  3. Horizontal bars - Instead of COPOD's scatter plot
    • Reason: More intuitive for distance-based outliers
  4. # pragma: no cover - Exclude visualization from test coverage
    • Reason: Consistent with COPOD's approach

Testing

Following PyOD conventions (see COPOD's test_copod.py lines 147-149), visualization methods use # pragma: no cover and are demonstrated via examples rather than unit tests.

Manual Validation:
Extensively tested with:

  • Multiple datasets (generated data, cardio.mat, Pima Indians Diabetes Dataset)
  • Various parameters (cutoffs, columns, feature_names)
  • 2D visualizable data for correctness verification (see screenshots below)

Test Results:

  • All 38 existing KNN tests pass (pytest pyod/test/test_knn.py -v)
  • Example scripts run successfully (python examples/knn_interpretability.py)

Backwards Compatibility

No breaking changes to existing API:

  • New attributes (X_train_) only created when needed
  • Optional feature (doesn't affect core functionality)
  • All existing tests pass

Use Cases

This feature enables:

  1. Fraud Detection - Identify which transaction features are suspicious
  2. Network Security - Understand which traffic patterns trigger alerts
  3. Quality Control - Pinpoint which product measurements are defective
  4. Healthcare - Understand patient outlier profiles
  5. IoT Monitoring - Detect which sensor readings are anomalous

Research Foundation

This implementation is based on the method described in:

Krenmayr, Lucas and Goldstein, Markus (2023). "Explainable Outlier Detection Using Feature Ranking for k-Nearest Neighbors, Gaussian Mixture Model and Autoencoders." In 15th International Conference on Agents and Artificial Intelligence (ICAART), pp. 245-253. https://doi.org/10.5220/0011631900003411

BibTeX:

@inproceedings{Lucas2023xodknn,
  author = {Krenmayr, Lucas and Goldstein, Markus},
  year = {2023},
  month = {02},
  pages = {245-253},
  title = {Explainable Outlier Detection Using Feature Ranking for k-Nearest Neighbors, Gaussian Mixture Model and Autoencoders},
  doi = {10.5220/0011631900003411}
}

This PR implements dimensional feature-ranking for KNN outlier interpretation per the method described in the paper above, and extends PyOD with both visualization and a returned explainability score vector (per-dimension evidence), addressing a gap noted in COPOD's implementation.


Screenshots/Examples

2D Validation Examples

These examples demonstrate the correctness of the dimensional explainability approach on 2D data where the results can be visually verified.

Figure 7.3: 2D k-NN Inlier
2D k-NN Inlier

Demonstrates how an inlier point has low k-NN scores for both dimensions. The overall k-NN score is low, and both individual dimensions show low anomaly scores, correctly identifying this as a normal sample.

Figure 7.4: 2D k-NN X-Dimension Outlier
2D k-NN X-Dimension Outlier

Demonstrates how a point outlying only in the X-dimension has a high k-NN score in the X-dimension and a low score in the Y-dimension. This shows the method's ability to isolate anomalies to specific dimensions, which the overall k-NN score alone cannot indicate.

Figure 7.5: 2D k-NN Y-Dimension Outlier
2D k-NN Y-Dimension Outlier

Demonstrates how a point outlying only in the Y-dimension has a high k-NN score in the Y-dimension and a low score in the X-dimension. Note that the overall k-NN score shown is outlying overall but it cannot by itself indicate in which dimension, which has also been demonstrated by the previous figure.

Figure 7.6: 2D k-NN Outlier (Both Dimensions)
2D k-NN Outlier Both Dimensions

Demonstrates how an outlier point has high k-NN scores for both dimensions. This shows the method correctly identifies multi-dimensional anomalies.

Real-World Dataset Examples

However, most real datasets have more than 2 features. Therefore, we demonstrate the application of this explainability technique on the real-world Pima Indians Diabetes Dataset (Smith et al., 1988) after performing Min-Max Scaling on the dataset (hence the k-NN score values per-dimension as well as overall are more compact).

Figure 7.7: Pima k-NN Outlier 1
Pima k-NN Outlier 1

Demonstrates how the most outlying point is an anomaly mainly because of the Diabetes Pedigree Function and the insulin.

Figure 7.8: Pima k-NN Outlier 2
Pima k-NN Outlier 2

Demonstrates how the second most outlying point is an anomaly mainly because of the Age and the Skin Thickness, which is a very different reason from the previous outlier. This shows how different outliers can have different dimensional contributions.

Figure 7.9: Pima k-NN Inlier
Pima k-NN Inlier

Demonstrates how an inlier overall is also having low k-NN scores in all individual dimensions, confirming the method's consistency.


Related Work

This PR enhances COPOD's API pattern for dimensional interpretability, and is directly inspired by:

  • Krenmayr & Goldstein, 2023, ICAART (see Research Foundation above): Paper describing feature ranking based explainability for KNN, GMM, and Autoencoders.
  • COPOD's explain_outlier() — API consistency for explainability in PyOD; this PR now completes the TODO of returning programmatic scores.
  • Modern explainability tools (SHAP, LIME, EBM) — Visualization style.
  • PyOD's emphasis on interpretable outlier detection — Library philosophy.

Impact

Benefits:

  • Adds interpretability to KNN outlier detection
  • Provides both visualization and programmatic access to scores
  • Completes explainability interface (addresses COPOD's TODO)
  • Maintains consistency with existing PyOD patterns

Backward Compatibility:

  • No API changes to existing functionality
  • No breaking changes
  • Optional feature (doesn't affect core functionality)
  • All existing tests pass

Checklist

All Submissions Basics:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
    • Added unit test for get_outlier_explainability_scores() method in test_knn.py (tests numerical logic)
    • Following COPOD's pattern, visualization methods use # pragma: no cover and are demonstrated via examples (see test_copod.py lines 147-149)
  • Have you successfully ran tests with your changes locally?
    • All 38 KNN tests pass (pytest pyod/test/test_knn.py -v)
    • Example scripts run successfully (python examples/knn_interpretability.py)
  • Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
  • Does your submission have appropriate code coverage? The cutoff threshold is 95% by Coversall.
    • Core functionality coverage unchanged
    • Visualization excluded with # pragma: no cover (following COPOD pattern)
    • Overall coverage remains ≥95%

Files Changed

  • pyod/models/knn.py - Added explainability methods
  • examples/knn_interpretability.py - New example file
  • pyod/test/test_knn.py - Added unit test for get_outlier_explainability_scores() method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant