A comprehensive machine learning pipeline for predicting Non-Alcoholic Fatty Liver Disease (NAFLD) using multiple classification algorithms. This project evaluates 27 different machine learning models.
This project implements a complete ML pipeline that:
- Preprocesses clinical data with proper handling of missing values
- Trains and evaluates 27 different classification models
- Performs stratified cross-validation for robust performance estimation
- Generates detailed performance reports in multiple formats (CSV, Excel, Markdown, LaTeX)
The pipeline includes the following model families:
- Fine Tree, Medium Tree, Coarse Tree
- Linear Discriminant Analysis
- Logistic Regression
- Gaussian Naive Bayes
- Kernel Naive Bayes (custom implementation)
- Linear, Quadratic, Cubic kernels
- Gaussian kernels (Fine, Medium, Coarse)
- Fine, Medium, Coarse variants
- Cosine, Cubic, and Weighted distance metrics
- Bagged Trees
- Boosted Trees (AdaBoost)
- RUSBoosted Trees
- Subspace Discriminant
- Subspace KNN
.
├── paper_table_pipeline. py # Main training pipeline with all models
├── run_crossval.py # Cross-validation script
├── preprocess_data. py # Data preprocessing utilities
├── export_excel_report.py # Excel report generation with charts
├── feature_names.csv # Processed feature names
├── missing_counts.csv # Missing value analysis
├── y_train.csv / y_test.csv # Train/test target splits
├── model_performance_comparison_cv.md # Cross-validation results (Markdown)
├── model_performance_comparison_cv.tex # Cross-validation results (LaTeX)
├── model_performance_report.xlsx # Comprehensive Excel report
└── TEAM.md # Project team information
pip install numpy pandas scikit-learn openpyxl
pip install imbalanced-learn # Optional, for RUSBoost- Preprocess the data:
python preprocess_data. py- Train all models and generate comparison:
python paper_table_pipeline.py- Run cross-validation:
python run_crossval.py- Generate Excel report:
python export_excel_report.pyThe pipeline computes comprehensive metrics for each model:
- Accuracy (%): Overall classification accuracy
- AUC: Area Under the ROC Curve
- Precision (PPV %): Positive Predictive Value
- NPV (%): Negative Predictive Value
- Recall/Sensitivity (%): True Positive Rate
- Specificity (%): True Negative Rate
- Automatic handling of missing values (median for numeric, mode for categorical)
- Standard scaling for numeric features
- One-hot encoding for categorical features
- Stratified train-test split (80/20)
- 5-fold stratified cross-validation
- Mean and standard deviation metrics across folds
- Robust performance estimation
- CSV export for programmatic analysis
- Excel reports with conditional formatting and charts
- Markdown and LaTeX tables for publications
- Color-coded performance visualization
- model_performance_comparison.csv: Single train/test run results
- model_performance_comparison_cv.csv: Cross-validation results with mean±std
- model_performance_report.xlsx: Formatted Excel report with charts
- X_train_processed.npy / X_test_processed.npy: Preprocessed feature matrices
See TEAM.md for the list of contributors.
- The target variable is binary:
Status == 'D'(died) as positive class - Missing values are handled during preprocessing
- Random state is set to 0 for reproducibility
- Dataset path needs to be configured in each script (currently set to
/Users/dikshadamahe/Downloads/cirrhosis.csv)
Update the dataset path in the following files:
paper_table_pipeline.py(line 203)run_crossval.py(line 118)preprocess_data.py(line 17)
src = Path('/path/to/your/cirrhosis.csv')The project includes a custom Kernel Naive Bayes classifier that uses kernel density estimation for continuous features, providing an alternative to Gaussian Naive Bayes assumptions.
This project is part of academic research on NAFLD detection.