This analysis aims to develop a machine learning model to predict whether a loan is healthy (Class 0) or risky (Class 1). By identifying high-risk loans, the bank can mitigate credit risk and improve loan approval decisions.
The dataset includes financial variables such as loan size, interest rate, borrower income, debt-to-income ratio, number of accounts, derogatory marks, and total debt. The target variable (loan_status) indicates loan health:
- Class 0 (Healthy Loans): 18,759 samples
- Class 1 (Risky Loans): 625 samples
Key Observation: The dataset is highly imbalanced, with healthy loans dominating. While desirable for a bank, this imbalance can hinder the model's ability to detect risky loans accurately.
- The graph below shows that loan amounts are normally distributed, with most requests centered around $10,000 and fewer at the extremes.
-
Data Preprocessing:
- Features (X) and labels (y) were separated.
- Data was split into training and testing sets, stratified to preserve class balance.
-
Model Training:
- A Logistic Regression model was implemented using the
lbfgssolver. - The model was trained on the stratified training dataset to ensure consistent results across imbalanced classes.
- A Logistic Regression model was implemented using the
-
Evaluation:
- Performance was assessed using a confusion matrix and classification report, providing accuracy, precision, recall, and F1-score metrics.
- Overall Accuracy: 99%
- Precision:
- Class 0: 1.00 (Perfect precision)
- Class 1: 0.87
- Recall:
- Class 0: 1.00 (Perfect recall)
- Class 1: 0.89
- F1-Score:
- Class 0: 1.00
- Class 1: 0.88
- The model excels at predicting healthy loans, achieving perfect precision and recall.
- Performance for risky loans is strong but slightly lower due to class imbalance.
To improve the model's ability to predict risky loans:
- Oversample Risky Loans: Increase the number of risky loan samples to balance the dataset.
- Adjust Class Weights: Penalize misclassification of risky loans more heavily.
- Precision-Recall Tradeoff: Focus on improving precision and recall for Class 1 (risky loans), as missing high-risk cases is costlier for the bank.
- Ongoing Monitoring: Regularly validate the model with real-world loan data to ensure its performance remains robust over time.
The Logistic Regression model performs exceptionally well for healthy loans and reasonably well for risky loans. With targeted improvements, this model can provide reliable predictions, enabling better credit risk management and decision-making.
Notebook: credit_risk_classification.ipynb
Documentation: README.md (includes analysis summary)
License: Contains the public license information. ___
This dataset is fictional and is used to showcase my machine learning skills in predicting healthy and risky loans, as well as applying data transformation techniques in a real-world context.


