Course Project: CSN-371 Artificial Intelligence
Instructor: Prof. Pradumn K. Pandey
A biometric authentication system that identifies users based on their unique typing patterns using keystroke dynamics. This implementation uses digraph timing features and evaluates performance using the Equal Error Rate (EER) metric, based on research by Iapa & Cretu (2021).
This project implements a keystroke dynamics authentication system that:
- Extracts timing features from keystroke data (digraphs)
- Uses modified Manhattan distance metric for improved accuracy
- Evaluates authentication performance using leave-one-out methodology
- Compares standard and modified distance metrics
- Analyzes the effect of feature selection on authentication accuracy
-
Digraph Feature Extraction: Analyzes timing patterns between consecutive keystrokes
- DU1: First key down to first key up (dwell time)
- DU2: Second key down to second key up (dwell time)
- DUtotal: First key down to second key up (total time)
-
Modified Manhattan Distance: Implements weighted distance metric with reduced weight for DUtotal features (default: 1/3)
-
Multiple Normalization Techniques:
- Decimal scaling for standard Manhattan distance
- Min-max scaling for modified Manhattan distance
-
Comprehensive Evaluation:
- Leave-one-out cross-validation
- FAR (False Accept Rate) and FRR (False Reject Rate) calculations
- EER (Equal Error Rate) computation
- Visualization of error rates vs. thresholds
keystroke_AI/
├── data/
│ ├── raw/ # Raw keystroke data files (user0001.txt - user0080.txt)
│ └── processed/ # Processed feature vectors
│ ├── all_features.csv
│ ├── decimal_vectors.csv
│ └── minmax_vectors.csv
├── src/
│ ├── data_processing.py # Data loading and digraph extraction
│ ├── feature_extraction.py # Feature vector creation
│ ├── metrics.py # Distance metric implementations
│ └── authentication.py # Authentication logic and evaluation
├── main.py # Main execution script
└── README.md # This file
pip install numpy pandas matplotlib- Clone the repository:
git clone https://github.com/Haaziq386/keystroke_AI.git
cd keystroke_AI- Ensure your data is in the correct format in
data/raw/:- Files named
user####.txt(e.g.,user0001.txt) - Format:
key_code event_type timestampkey_code: ASCII code of the keyevent_type: 0 for press, 1 for releasetimestamp: Milliseconds since epoch
- Files named
Run the main script to process data and evaluate authentication:
python main.pyThis will:
- Process raw keystroke data files
- Extract digraph features
- Create and normalize feature vectors
- Compare standard vs. modified Manhattan distance metrics
- Evaluate the effect of different numbers of digraphs
- Generate visualization plots
Each user file contains keystroke events in the format:
key_code event_type timestamp
16 0 434889 # Key 16 pressed at time 434889
86 0 435006 # Key 86 pressed at time 435006
86 1 435146 # Key 86 released at time 435146
16 1 435221 # Key 16 released at time 435221
Each segment is represented by timing features for the most common digraphs:
segment_id, user_id, [digraph]_DU1, [digraph]_DU2, [digraph]_DUtotal, ...
- Load raw keystroke data
- Match press and release events
- Segment into chunks of ~1000 keystrokes
- Identify the most common digraphs (default: 12)
- Calculate DU1, DU2, and DUtotal for each digraph
- Create feature vectors using median values per segment
- Decimal Scaling: For standard Manhattan distance
- Min-Max Scaling: For modified Manhattan distance
- Leave-one-out cross-validation
- Distance calculation between feature vectors
- Threshold-based classification
- FAR/FRR/EER computation
- FAR (False Accept Rate): Percentage of impostor attempts incorrectly accepted
- FRR (False Reject Rate): Percentage of genuine attempts incorrectly rejected
- EER (Equal Error Rate): Point where FAR equals FRR (lower is better)
The system generates several plots:
Standard_Manhattan_Distance.png: FAR/FRR curves for standard metricModified_Manhattan_Distance.png: FAR/FRR curves for modified metricManhattan_Distance_Comparison.png: Side-by-side comparisonDigraph_Count_Effect.png: EER vs. number of digraphs
In main.py:
segment_size: Number of keystrokes per segment (default: 1000)num_digraphs: Number of most common digraphs to use (default: 12)
In authentication.py:
du_total_weight: Weight for DUtotal in modified distance (default: 1/3)
Handles raw data loading and preprocessing:
read_raw_file(user_id): Load data for a specific userprocess_all_users(): Process all users and extract digraphsextract_digraph_features(events): Calculate timing features
Creates feature vectors:
identify_common_digraphs(all_features): Find most frequent digraphscreate_feature_vectors(all_features): Build feature vectors
Implements distance metrics:
manhattan_distance(v1, v2): Standard Manhattan distancemodified_manhattan_distance(v1, v2): Weighted Manhattan distance
Performs authentication and evaluation:
leave_one_out_evaluation(feature_vectors): Cross-validationcalculate_error_rates(evaluation_results, thresholds): Compute FAR/FRR/EER
The modified Manhattan distance metric with reduced DUtotal weight typically achieves:
- Lower EER compared to standard Manhattan distance
- Better discrimination between genuine and impostor attempts
- Optimal performance with ~12 most common digraphs
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.
This implementation is based on research in keystroke dynamics authentication, particularly the use of digraph timing features and modified distance metrics for improved accuracy.
For questions or feedback, please open an issue on GitHub.