This project demonstrates the complete implementation of Logistic Regression from scratch using gradient descent, with comprehensive comparisons to industry-standard libraries. The journey includes two datasets to showcase both real-world challenges and model capabilities.
- Introduction
- Dataset 1: E-Commerce Customer Retention
- Dataset 2: Breast Cancer Classification
- Key Learnings
- Technical Implementation
- Results Summary
- Conclusions
Objective: Build a logistic regression classifier from scratch to deeply understand:
- Sigmoid activation function
- Binary cross-entropy loss
- Gradient descent optimization
- Classification metrics
- Real-world data challenges
Tools Used:
- Python, NumPy, Pandas, Matplotlib, Seaborn
- Scikit-learn (for comparison)
- SMOTE (for class imbalance)
- PCA (for dimensionality reduction)
Predict whether a customer will return for future purchases based on their transaction and behavioral data.
- Size: 22,049 transactions
- Features: 33 (after encoding)
- Demographics: Age, Gender, City
- Behavioral: Session_Duration_Minutes, Pages_Viewed, engagement_score
- Transaction: Unit_Price, Quantity, Discount_Amount, Payment_Method
- Post-purchase: Delivery_Time_Days, Customer_Rating
- Target: Is_Returning_Customer (Binary: 0/1)
- Class Distribution: 81% returning, 19% not returning (IMBALANCED)
1. Distribution Analysis:
- Age: Normal distribution (18-75 years, mean=35)
- Unit_Price: EXTREME right skew (outliers up to $7,900 - luxury items)
- Discount_Amount: 50% orders have NO discount (spike at zero)
- Customer_Rating: Left skew (most ratings 4-5 stars)
2. Correlation Analysis:
- ALL features showed WEAK correlation with target (< 0.07)
- Strongest: Quantity (0.069) - still very weak!
- Multicollinearity detected: Unit_Price β Total_Amount (0.85) β Dropped Total_Amount
3. Outlier Detection:
- Unit_Price: 570 Z-score outliers
- Discount_Amount: 409 Z-score outliers
- Domain decision: KEPT outliers (real business patterns - bulk orders, luxury items, sales)
# 1. Drop unnecessary columns
df = df.drop(columns=["Order_ID", "Customer_ID", "Total_Amount"])
# 2. Feature Engineering
df['engagement_score'] = df['Session_Duration_Minutes'] * df['Pages_Viewed']
df['fast_delivery'] = (df['Delivery_Time_Days'] <= 3).astype(int)
df['day_of_week'] = pd.to_datetime(df['Date']).dt.dayofweek
df['month'] = pd.to_datetime(df['Date']).dt.month
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
# 3. Encoding
- Binary encoding: Gender (Male=1, Female=0)
- One-hot encoding: City, Product_Category, Payment_Method, Device_Type (drop_first=True)
# 4. Variance Filtering
threshold = 0.05 (removed low-variance features)
# 5. Multicollinearity Removal
correlation > 0.8 β dropped
# 6. Train-Test Split (Stratified 80-20)
# 7. Scaling: Log-transform + Z-score normalization
logged = np.log1p(df)
scaled = (logged - mean) / std
# 8. SMOTE (balance classes to 50-50)Original features: 33
PCA components: 26 (retaining 95% variance)
Top 5 components explain: [6.89%, 6.19%, 5.10%, 4.92%, 4.58%]
PCA Visualization Result:
- Complete overlap between returning and non-returning customers
- No clear cluster separation
- Conclusion: Features do not linearly separate classes
Epoch 900: Loss=0.6756, Accuracy=0.5813
Accuracy: 57.41%
precision recall f1-score support
0 0.30 0.03 0.06 804
1 0.82 0.98 0.89 3606
accuracy 0.81 4410
ROC AUC: 0.60
Despite applying industry-standard techniques:
- Log-transformation and scaling
- SMOTE for class imbalance
- PCA for dimensionality reduction
- Multiple algorithms (Logistic Regression, Random Forest, XGBoost)
Result: Models achieved only ~58% accuracy (barely better than random guessing)
Analysis: The dataset's transactional features showed weak separation between returning and non-returning customers:
- PCA visualization revealed complete class overlap
- All models resorted to majority-class prediction (81% baseline)
- Recall for minority class: only 3-5%
Business Interpretation: Customer return behavior is likely driven by factors NOT present in transactional data:
- Product quality and satisfaction
- Customer service interactions
- Competitor offerings
- Personal circumstances
- Brand perception
Recommendation: To improve predictions, collect additional features:
- Product review text (sentiment analysis)
- Customer service tickets
- Email engagement rates
- Social media sentiment
- Customer lifetime value (CLV)
To validate that our manual implementation works correctly, we tested on a well-established dataset with strong feature-target relationships.
- Source: Sklearn's breast cancer dataset
- Size: 569 samples
- Features: 30 (tumor measurements)
- Target: Malignant (0) vs Benign (1)
- Class Distribution: Balanced
# 1. Train-test split (80-20)
# 2. StandardScaler (Z-score normalization)Epoch 900: Loss=0.1107, Accuracy=98.24%
Test Set Performance:
Accuracy: 98.25%
Precision: 98.60%
Recall: 98.60%
F1-score: 98.60%
Accuracy: 97.37%
Precision: 97.22%
Recall: 98.60%
F1-score: 97.90%
| Metric | Manual Model | Sklearn Model | Difference |
|---|---|---|---|
| Accuracy | 98.25% | 97.37% | +0.88% |
| Precision | 98.60% | 97.22% | +1.37% |
| Recall | 98.60% | 98.60% | 0.00% |
| F1-Score | 98.60% | 97.90% | +0.69% |
Manual model OUTPERFORMED sklearn! π₯
- E-commerce dataset taught us that feature quality matters MORE than model complexity
- Real-world data often lacks the signal needed for accurate predictions
- Knowing when to collect more data is a crucial skill
- Building from scratch revealed the math behind the "black box"
- Manual model matched and even exceeded sklearn performance on breast cancer data
- Proved our implementation is correct
- SMOTE helped balance training data
- But didn't fix fundamental feature weakness in e-commerce dataset
- Showed that balancing techniques can't create signal that doesn't exist
- PCA visualization clearly showed feature limitations
- When classes don't separate in 2D PCA, linear models will struggle
- Useful diagnostic tool, not a magic fix
class LogisticRegressionManual:
def __init__(self, learning_rate=0.01, epoch=1000, random_state=42):
self.learning_rate = learning_rate
self.epoch = epoch
self.random_state = random_state
self.cost_history = []
self.weight = None
self.bias = None
def sigmoid(self, z):
return 1/(1+np.exp(-np.clip(z, -500, 500)))
def fit(self, X, y):
X = np.array(X)
y = np.array(y).ravel()
np.random.seed(self.random_state)
n_samples, n_features = X.shape
self.weight = np.zeros(n_features)
self.bias = 0
for i in range(self.epoch):
# Forward pass
z = X @ self.weight + self.bias
y_pred = self.sigmoid(z)
# Binary cross-entropy loss
eps = 1e-15
y_pred_clip = np.clip(y_pred, eps, 1 - eps)
cost = -(y * np.log(y_pred_clip) + (1-y) * np.log(1-y_pred_clip))
self.cost_history.append(np.mean(cost))
# Gradients
error = y_pred - y
dw = (1/n_samples) * (X.T @ error)
db = (1/n_samples) * (np.sum(error))
# Update weights
self.weight -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
X = np.array(X)
z = X @ self.weight + self.bias
y_pred = self.sigmoid(z)
return (y_pred >= 0.5).astype(int)Key Components:
- Sigmoid: Maps linear output to probability[^1]
- Binary Cross-Entropy: Penalizes wrong probability predictions
- Gradient Descent: Iteratively optimizes weights using derivatives
- Threshold: 0.5 for binary classification
- Challenge: Weak feature-target relationship
- Manual Model: 58.13% accuracy
- Sklearn Model: 57.41% accuracy
- Learning: Feature engineering and additional data needed
- Success: Strong feature-target relationship
- Manual Model: 98.25% accuracy (BETTER than sklearn!)
- Sklearn Model: 97.37% accuracy
- Learning: Manual implementation is production-ready
β Manual logistic regression implementation is correct and competitive β Comprehensive preprocessing pipeline (log-transform, SMOTE, PCA) β Proper handling of imbalanced data β Thorough EDA revealed dataset limitations early β Domain knowledge applied to feature engineering
β E-commerce transactional features too weak for return prediction β PCA and SMOTE couldn't compensate for missing signal β Complex models (Random Forest, XGBoost) also failed β confirms feature issue
"The best model in the world can't extract signal that doesn't exist in the data. Knowing when to go back and collect better features is as important as building better models."
- Collect customer service interaction data
- Add product review sentiment scores
- Track email engagement metrics
- Include competitor pricing data
- Consider time-series analysis (purchase patterns over time)
- Implement mini-batch gradient descent
- Add L1/L2 regularization
- Implement learning rate scheduling
- Add early stopping mechanism
Logistic-Regression-Project/
βββ data/
β βββ raw/
β β βββ ecommerce_data.csv
β βββ processed/
β βββ X_train.csv
β βββ X_test.csv
β βββ y_train.csv
β βββ y_test.csv
βββ notebooks/
β βββ 01_EDA_Ecommerce.ipynb
β βββ 02_Preprocessing.ipynb
β
βββ src/
β βββ LogisticRegressionManual.py
β
βββ README.md
# Clone repository
git clone https://github.com/Harimhs/Logistic-Regression-Project.git
cd Logistic-Regression-Project
# Install dependencies
pip install -r requirements.txt
# Run notebooks in order
jupyter notebooknumpy==1.24.3
pandas==2.0.3
matplotlib==3.7.2
seaborn==0.12.2
scikit-learn==1.3.0
imbalanced-learn==0.11.0
Hariharasudhan M Data Science Enthusiast | Machine Learning Engineer GitHub | LinkedIn | Kaggle
MIT License