Water Quality Classification

Overview

This project builds a multiclass classification system to predict use-based water quality classes from physico-chemical, nutrient, and biological parameters. The focus is on handling real-world environmental data, which is noisy, incomplete, and heavily imbalanced.

Data Source

The dataset is compiled from official data.gov.in pages of the following:

National Water Monitoring Programme (NWMP)
Maharashtra Pollution Control Board (MPCB)

Key characteristics

Real field measurements from rivers and coastal water bodies
Multiple districts across Maharashtra
Significant noise, missing values, and class imbalance
Not a benchmark or synthetic dataset

Data Cleaning and Preprocessing

Removal of physically implausible and inconsistent values
Detection and correction of hidden nulls (blanks, placeholders, malformed entries)
Standardization of categorical labels

Missing Value Handling

Rows with missing target variable (use_based_class) were dropped
Feature imputation performed using group-based imputation by district
- Median for numerical features
- Mode for categorical features
This preserves spatial and environmental context better than global imputation

Exploratory Data Analysis and Feature Selection

EDA focused on statistical relevance

Numerical Features

Kruskal-Wallis test used to identify features differing significantly across classes
Mutual Information used to capture non-linear relationships with the target

Categorical Features

Chi-square test for dependence with the target variable
Cramer’s V to measure strength of association

Only statistically informative features were retained for model training.

statistical scores of all the columns

Model Training

Problem Setup

Task: Multiclass classification (A, B, C, E)
Inputs: Selected numerical and categorical features
Major challenge: Severe class imbalance with very small minority classes

Models Used

Logistic Regression (baseline, class-weighted)
Random Forest (class-weighted, randomized search)
XGBoost (randomized search with stratified cv)
CatBoost (native categorical handling)

Training Strategy

Stratified train test split
Stratified 5-fold cross-validation
Cost-sensitive learning using class weights
Leakage-safe preprocessing via pipelines

Evaluation Metrics

Accuracy was not used for model selection due to imbalance.

Primary metrics:

Macro F1 score
Macro Recall

These metrics treat all classes equally and penalize models that ignore rare pollution categories.

Results Summary

Model	Macro F1	Macro Recall
Logistic Regression	~0.61	~0.73
Random Forest	~0.53	~0.50
XGBoost	~0.62	~0.62
CatBoost	~0.70	~0.74

CatBoost achieved the best balance between majority and minority class performance.

evaluation scores across models

Why Scores Do Not Exceed ~0.75

Extremely small sample sizes for minority classes
High variance of macro metrics with low per-class support
Overlapping physico-chemical ranges across classes
Measurement noise inherent to environmental field data
Evaluation performed on a strict, unseen test set

The observed ceiling is data-driven, not a modeling limitation.

Limitations

Very limited samples for some classes
No temporal modeling or seasonal aggregation
No external hydrological or land-use features
No class-specific threshold tuning

Key Takeaway

This project demonstrates a realistic and statistically rigorous approach to multiclass water quality classification using noisy, imbalanced, real-world environmental data, prioritizing minority class sensitivity over misleading accuracy metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Data		Data
Notebooks		Notebooks
Screenshots		Screenshots
.gitignore		.gitignore
README.md		README.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Water Quality Classification

Overview

Data Source

Data Cleaning and Preprocessing

Missing Value Handling

Exploratory Data Analysis and Feature Selection

Numerical Features

Categorical Features

Model Training

Problem Setup

Models Used

Training Strategy

Evaluation Metrics

Results Summary

Why Scores Do Not Exceed ~0.75

Limitations

Key Takeaway

About

Uh oh!

Languages

gunjitsinha/water-sample-prediction

Folders and files

Latest commit

History

Repository files navigation

Water Quality Classification

Overview

Data Source

Data Cleaning and Preprocessing

Missing Value Handling

Exploratory Data Analysis and Feature Selection

Numerical Features

Categorical Features

Model Training

Problem Setup

Models Used

Training Strategy

Evaluation Metrics

Results Summary

Why Scores Do Not Exceed ~0.75

Limitations

Key Takeaway

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages