-
Notifications
You must be signed in to change notification settings - Fork 0
Aaronjill/titanic-classifier
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
- Titanic Survival Prediction - Hi there! This mini‑project tackles the classic Titanic survival prediction problem. Using a clean version of the Titanic dataset, I built two models from scratch: a Decision Tree and a Logistic Regression line-by-line implementation — no black‑box scikit‑learn models! - Table of Contents - 1. Motivation 2. Dataset Overview 3. Exploratory Data Analysis (EDA) 4. Feature Engineering & Encoding 5. Model Implementation 6. Results & Interpretability 7. Folder Structure 8. Next Steps - Motivation - I wanted to: - Deeply understand model behavior, not just use pre-built libraries. - Practice building classifier logic from scratch — for interpretability and learning. - Generate clear visualizations to show how features influence predictions. - Dataset Overview - - Source: Kaggle Titanic dataset - Train set: 891 passengers - Test set: 418 passengers - Target variable: `Survived` - Key columns: - `PassengerId`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, `Embarked`, plus engineered features like Sex_Title, FamilySize, etc. - Exploratory Data Analysis (EDA) - 1. Class Distribution A count plot showed balanced survival vs. non-survival — good signal to begin with. 2. Correlation Heatmap Using numeric-only columns, I plotted correlations. - Feature Engineering & Encoding - 1. Label Encoding I encoded rare categorical columns (e.g., Title, Deck) into numerical format to reduce cardinality for tree-based modeling. 2. One-Hot Encoding I applied full one-hot encoding on all object-typed columns (including Embarked, Sex, Title, etc.) before Logistic Regression to produce a consistent feature matrix for weight-based modeling. - Model Implementation - A. Decision Tree (from scratch) 1. Built recursively using Gini impurity. 2. Stopping criteria: max_depth=5, min_samples_split=10. 3. Feature Importances extracted and visualized B. Logistic Regression (from scratch) Implemented manual logistic regression with: 1. Adam optimizer 2. L2 regularization 3. Early stopping 4. Training output: Iteration 0, Loss: 2.2590 Iteration 100, Loss: 0.4901 ... Early stopping triggered. - Results & Interpretability - Model Accuracy Precision Recall F1 Score Decision Tree ~0.77 ~0.73 ~0.70 ~0.71 Logistic Regression ~0.79 ~0.76 ~0.72 ~0.74 Interpretation Visuals 1.Decision Tree Feature Importance (Shows how much each feature contributed) 2.Logistic Regression Coefficients (Highlights features that increase/decrease survival probability) What those weights tell us: “If all else is equal, does this feature raise or lower the probability of survival?” A positive coefficient (e.g., Fare) — increases chance of surviving. A negative one (e.g., Sex_male) — decreases chance. - Next Steps - 1. Hyperparameter tuning (e.g., varying max_depth, lambda, learning rate) 2. Add cross-validation 3. Try additional models (Random Forest, XGBoost) 4. Do proper train-test validation and improve submission pipeline
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published