Diabetes Prediction

Entire codebase for predicting diabetes with labs and lifestyle factors.

Project Structure

diabetes-prediction/
├── data/
│   ├── raw/                 # Original, immutable data
│   ├── processed/           # Cleaned and preprocessed data
│   └── external/            # External data sources
├── notebooks/               # Jupyter notebooks for exploration and analysis
├── src/                     # Source code for the project
│   ├── data/               # Data processing scripts
│   ├── features/           # Feature engineering scripts
│   └── models/             # Model training and prediction scripts
├── models/                  # Trained model files
├── reports/                 # Generated analysis and reports
│   └── figures/            # Figures and visualizations
├── tests/                   # Unit and integration tests
├── config/                  # Configuration files
├── requirements.txt         # Python dependencies
├── setup.py                # Package installation configuration
└── README.md               # This file

Prerequisites

Ensure you have the following installed:

Python 3.13+
Poetry
Git

Getting Started

Installation

Clone the repository:

git clone https://github.com/Diabetic-Research-Group/diabetes-prediction.git
cd diabetes-prediction

Install dependencies using Poetry:
```
poetry install
```

Usage

Place raw data in data/raw/
Use notebooks in notebooks/ for exploratory analysis
Develop reusable code in src/
Save trained models to models/
Generate reports and figures in reports/

Hyperparameter Tuning & Model Training

Step 1: Prepare Data Splits

Run the data preparation script to create balanced training and test splits:

    poetry run python -m src.data.prepare_diabetes_data

This will:

Load the raw dataset from Hugging Face (or use local parquet if available)
Create a balanced training set with ~4,320 samples per class (Not diabetic, T2D, Other)
Create a held-out test set preserving the original class distribution
Save datasets as parquet files to data/diabetes_train.parquet and data/diabetes_test.parquet
Log class distributions and data info

Note: Update the Hugging Face URLs in src/data/prepare_diabetes_data.py after uploading datasets.

Step 2: Run Hyperparameter Tuning with Optuna

Execute the Optuna-based tuning script to optimize LightGBM hyperparameters:

poetry run python -m src.models.optuna_multiclass_tuning

This will:

Use 10-fold stratified cross-validation on the balanced training set to evaluate each trial
Run 20 Optuna trials, tuning 7 key hyperparameters:
- n_estimators (200–800)
- learning_rate (0.01–0.2, log scale)
- max_depth (3–12)
- num_leaves (16–128)
- min_child_samples (5–50)
- reg_alpha (L1, 1e-8–1.0, log scale)
- reg_lambda (L2, 1e-8–1.0, log scale)
Log CV metrics per trial to MLflow (under nested runs)
Select the best model based on macro ROC-AUC from CV
Train the best model on the full training set
Evaluate on the held-out test set and log test metrics to MLflow (parent run)
Log confusion matrices (PNG + JSON), classification reports, datasets, and the trained model

Metrics Logged:

Per-trial CV metrics: cv_accuracy, cv_roc_auc_macro, cv_pr_auc_macro, cv_recall_macro, cv_specificity_macro, per-class precision/recall/F1
Final test metrics: test_* variants (use test_roc_auc_weighted and test_pr_auc_weighted for imbalanced evaluation)
Confusion matrices and classification reports as artifacts

Step 3: View Results in MLflow

Launch the MLflow UI to visualize tuning results and metrics:

poetry run mlflow ui

Then open http://localhost:5000 in your browser.

Navigation:

Experiment: "diabetes-optuna-multiclass"
Parent Run: "optuna_tuning" – contains:
- Dataset info (row counts, class distributions)
- Best hyperparameters (best_* params)
- Final test metrics (test_*, use weighted variants for imbalanced evaluation)
- Confusion matrix visualization (test_confusion.png)
- Classification report (test_classification_report.txt)
- Datasets folder with train/test parquet files
- Trained model (logged via MLflow)
Nested Runs: Each trial (trial_0, trial_1, ..., trial_19) contains:
- 10-fold CV metrics (cv_*)
- CV confusion matrix (cv_cm_trial_*.json)
- Non-numeric value warnings if any

Step 4: Use the Trained Model

The best model is automatically trained and logged to MLflow. Load it in Python:

import mlflow.lightgbm
import pandas as pd

# Set tracking URI to match your MLflow setup
mlflow.set_tracking_uri("file:./mlruns")

# Find the experiment and run
experiment = mlflow.get_experiment_by_name("diabetes-optuna-multiclass")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id], 
                          filter_string="tags.mlflow.runName = 'optuna_tuning'")
best_run_id = runs.iloc[0].run_id  # Most recent optuna_tuning parent run

# Load the trained model
model = mlflow.lightgbm.load_model(f"runs:/{best_run_id}/model")

# Make predictions on new data
# (Assuming X is a DataFrame with the same feature columns used in training)
y_pred = model.predict(X)
y_proba = model.predict_proba(X)

Or, load from latest run directly:

import mlflow

mlflow.set_tracking_uri("file:./mlruns")
latest_run = mlflow.search_runs(experiment_names=["diabetes-optuna-multiclass"], 
                                max_results=1).iloc[0]
model = mlflow.lightgbm.load_model(f"runs:/{latest_run.run_id}/model")

Development

Running Tests

pytest tests/

Code Quality

Format code with Black:

black src/ tests/

Check code style:

flake8 src/ tests/

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes Prediction

Project Structure

Prerequisites

Getting Started

Installation

Usage

Hyperparameter Tuning & Model Training

Step 1: Prepare Data Splits

Step 2: Run Hyperparameter Tuning with Optuna

Step 3: View Results in MLflow

Step 4: Use the Trained Model

Development

Running Tests

Code Quality

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
data		data
demos		demos
mlflow_tmp_arts		mlflow_tmp_arts
models		models
notebooks		notebooks
reports		reports
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mlflow.db		mlflow.db
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

License

Diabetic-Research-Group/diabetes-prediction

Folders and files

Latest commit

History

Repository files navigation

Diabetes Prediction

Project Structure

Prerequisites

Getting Started

Installation

Usage

Hyperparameter Tuning & Model Training

Step 1: Prepare Data Splits

Step 2: Run Hyperparameter Tuning with Optuna

Step 3: View Results in MLflow

Step 4: Use the Trained Model

Development

Running Tests

Code Quality

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages