GitHub - Aaronjill/titanic-classifier

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
README.txt		README.txt
gender_submission.csv		gender_submission.csv
test.csv		test.csv
titanic_prediction_classification.ipynb		titanic_prediction_classification.ipynb
train.csv		train.csv
~$titanic_project_journey.pptx		~$titanic_project_journey.pptx

Repository files navigation

- Titanic Survival Prediction -

Hi there! This mini‑project tackles the classic Titanic survival prediction problem. Using a clean version of the Titanic dataset, I built two models from scratch: a Decision Tree and a Logistic Regression line-by-line implementation — no black‑box scikit‑learn models!

- Table of Contents -

1. Motivation
2. Dataset Overview
3. Exploratory Data Analysis (EDA)
4. Feature Engineering & Encoding
5. Model Implementation
6. Results & Interpretability
7. Folder Structure
8. Next Steps

- Motivation -

I wanted to:

- Deeply understand model behavior, not just use pre-built libraries.
- Practice building classifier logic from scratch — for interpretability and learning.
- Generate clear visualizations to show how features influence predictions.

- Dataset Overview -

- Source: Kaggle Titanic dataset
- Train set: 891 passengers
- Test set: 418 passengers
- Target variable: `Survived`

- Key columns:

- `PassengerId`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, `Embarked`, plus engineered features like Sex_Title, FamilySize, etc.

- Exploratory Data Analysis (EDA) -

1. Class Distribution

A count plot showed balanced survival vs. non-survival — good signal to begin with.

2. Correlation Heatmap

Using numeric-only columns, I plotted correlations.

- Feature Engineering & Encoding -

1. Label Encoding

I encoded rare categorical columns (e.g., Title, Deck) into numerical format to reduce cardinality for tree-based modeling.

2. One-Hot Encoding

I applied full one-hot encoding on all object-typed columns (including Embarked, Sex, Title, etc.) before Logistic Regression to produce a consistent feature matrix for weight-based modeling.

- Model Implementation -

A. Decision Tree (from scratch)

1. Built recursively using Gini impurity.

2. Stopping criteria: max_depth=5, min_samples_split=10.

3. Feature Importances extracted and visualized

B. Logistic Regression (from scratch)

Implemented manual logistic regression with:

1. Adam optimizer

2. L2 regularization

3. Early stopping

4. Training output:

Iteration 0, Loss: 2.2590
Iteration 100, Loss: 0.4901
...
Early stopping triggered.

- Results & Interpretability -

Model Accuracy Precision Recall F1 Score
Decision Tree ~0.77 ~0.73 ~0.70 ~0.71
Logistic Regression ~0.79 ~0.76 ~0.72 ~0.74

Interpretation Visuals

1.Decision Tree Feature Importance

(Shows how much each feature contributed)

2.Logistic Regression Coefficients

(Highlights features that increase/decrease survival probability)

What those weights tell us:

“If all else is equal, does this feature raise or lower the probability of survival?”

A positive coefficient (e.g., Fare) — increases chance of surviving.

A negative one (e.g., Sex_male) — decreases chance.

- Next Steps -

1. Hyperparameter tuning (e.g., varying max_depth, lambda, learning rate)

2. Add cross-validation

3. Try additional models (Random Forest, XGBoost)

4. Do proper train-test validation and improve submission pipeline

About

No description, website, or topics provided.

Readme

Activity

0 stars

0 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 100.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Aaronjill/titanic-classifier

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages