Skip to content

Aaronjill/titanic-classifier

Repository files navigation

- Titanic Survival Prediction -

Hi there! This mini‑project tackles the classic Titanic survival prediction problem. Using a clean version of the Titanic dataset, I built two models from scratch: a Decision Tree and a Logistic Regression line-by-line implementation — no black‑box scikit‑learn models!


- Table of Contents -

1. Motivation  
2. Dataset Overview
3. Exploratory Data Analysis (EDA)
4. Feature Engineering & Encoding
5. Model Implementation
6. Results & Interpretability
7. Folder Structure
8. Next Steps

- Motivation - 

I wanted to:

- Deeply understand model behavior, not just use pre-built libraries.
- Practice building classifier logic from scratch — for interpretability and learning.
- Generate clear visualizations to show how features influence predictions.


- Dataset Overview -

- Source: Kaggle Titanic dataset  
- Train set: 891 passengers  
- Test set: 418 passengers  
- Target variable: `Survived`

- Key columns:

- `PassengerId`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, `Embarked`, plus engineered features like Sex_Title, FamilySize, etc.

- Exploratory Data Analysis (EDA) -

1. Class Distribution  

A count plot showed balanced survival vs. non-survival — good signal to begin with.

2. Correlation Heatmap  

Using numeric-only columns, I plotted correlations.  

- Feature Engineering & Encoding -

1. Label Encoding

I encoded rare categorical columns (e.g., Title, Deck) into numerical format to reduce cardinality for tree-based modeling.

2. One-Hot Encoding

I applied full one-hot encoding on all object-typed columns (including Embarked, Sex, Title, etc.) before Logistic Regression to produce a consistent feature matrix for weight-based modeling.


- Model Implementation -

A. Decision Tree (from scratch)

1. Built recursively using Gini impurity.

2. Stopping criteria: max_depth=5, min_samples_split=10.

3. Feature Importances extracted and visualized 

B. Logistic Regression (from scratch)

Implemented manual logistic regression with:

1. Adam optimizer

2. L2 regularization

3. Early stopping

4. Training output:

Iteration 0, Loss: 2.2590
Iteration 100, Loss: 0.4901
...
Early stopping triggered.


- Results & Interpretability - 

Model	               Accuracy	   Precision	Recall	   F1 Score
Decision Tree	        ~0.77	   ~0.73	~0.70	   ~0.71
Logistic Regression	~0.79	   ~0.76	~0.72	   ~0.74

Interpretation Visuals

1.Decision Tree Feature Importance

(Shows how much each feature contributed)

2.Logistic Regression Coefficients

(Highlights features that increase/decrease survival probability)

What those weights tell us:

“If all else is equal, does this feature raise or lower the probability of survival?”

A positive coefficient (e.g., Fare) — increases chance of surviving.

A negative one (e.g., Sex_male) — decreases chance.



- Next Steps -

1. Hyperparameter tuning (e.g., varying max_depth, lambda, learning rate)

2. Add cross-validation

3. Try additional models (Random Forest, XGBoost)

4. Do proper train-test validation and improve submission pipeline





About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published