This project builds a multiclass classification system to predict use-based water quality classes from physico-chemical, nutrient, and biological parameters. The focus is on handling real-world environmental data, which is noisy, incomplete, and heavily imbalanced.
The dataset is compiled from official data.gov.in pages of the following:
- National Water Monitoring Programme (NWMP)
- Maharashtra Pollution Control Board (MPCB)
Key characteristics
- Real field measurements from rivers and coastal water bodies
- Multiple districts across Maharashtra
- Significant noise, missing values, and class imbalance
- Not a benchmark or synthetic dataset
- Removal of physically implausible and inconsistent values
- Detection and correction of hidden nulls (blanks, placeholders, malformed entries)
- Standardization of categorical labels
-
Rows with missing target variable (
use_based_class) were dropped -
Feature imputation performed using group-based imputation by
district- Median for numerical features
- Mode for categorical features
-
This preserves spatial and environmental context better than global imputation
EDA focused on statistical relevance
- Kruskal-Wallis test used to identify features differing significantly across classes
- Mutual Information used to capture non-linear relationships with the target
- Chi-square test for dependence with the target variable
- Cramer’s V to measure strength of association
Only statistically informative features were retained for model training.
statistical scores of all the columns
- Task: Multiclass classification (A, B, C, E)
- Inputs: Selected numerical and categorical features
- Major challenge: Severe class imbalance with very small minority classes
- Logistic Regression (baseline, class-weighted)
- Random Forest (class-weighted, randomized search)
- XGBoost (randomized search with stratified cv)
- CatBoost (native categorical handling)
- Stratified train test split
- Stratified 5-fold cross-validation
- Cost-sensitive learning using class weights
- Leakage-safe preprocessing via pipelines
Accuracy was not used for model selection due to imbalance.
Primary metrics:
- Macro F1 score
- Macro Recall
These metrics treat all classes equally and penalize models that ignore rare pollution categories.
| Model | Macro F1 | Macro Recall |
|---|---|---|
| Logistic Regression | ~0.61 | ~0.73 |
| Random Forest | ~0.53 | ~0.50 |
| XGBoost | ~0.62 | ~0.62 |
| CatBoost | ~0.70 | ~0.74 |
CatBoost achieved the best balance between majority and minority class performance.
evaluation scores across models
- Extremely small sample sizes for minority classes
- High variance of macro metrics with low per-class support
- Overlapping physico-chemical ranges across classes
- Measurement noise inherent to environmental field data
- Evaluation performed on a strict, unseen test set
The observed ceiling is data-driven, not a modeling limitation.
- Very limited samples for some classes
- No temporal modeling or seasonal aggregation
- No external hydrological or land-use features
- No class-specific threshold tuning
This project demonstrates a realistic and statistically rigorous approach to multiclass water quality classification using noisy, imbalanced, real-world environmental data, prioritizing minority class sensitivity over misleading accuracy metrics.