Skip to content

This project uses 4 ML models to predict the usability of a water sample based on it's environmental conditions.

Notifications You must be signed in to change notification settings

gunjitsinha/water-sample-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Water Quality Classification

Overview

This project builds a multiclass classification system to predict use-based water quality classes from physico-chemical, nutrient, and biological parameters. The focus is on handling real-world environmental data, which is noisy, incomplete, and heavily imbalanced.


Data Source

The dataset is compiled from official data.gov.in pages of the following:

  • National Water Monitoring Programme (NWMP)
  • Maharashtra Pollution Control Board (MPCB)

Key characteristics

  • Real field measurements from rivers and coastal water bodies
  • Multiple districts across Maharashtra
  • Significant noise, missing values, and class imbalance
  • Not a benchmark or synthetic dataset

Data Cleaning and Preprocessing

  • Removal of physically implausible and inconsistent values
  • Detection and correction of hidden nulls (blanks, placeholders, malformed entries)
  • Standardization of categorical labels

Missing Value Handling

  • Rows with missing target variable (use_based_class) were dropped

  • Feature imputation performed using group-based imputation by district

    • Median for numerical features
    • Mode for categorical features
  • This preserves spatial and environmental context better than global imputation


Exploratory Data Analysis and Feature Selection

EDA focused on statistical relevance

Numerical Features

  • Kruskal-Wallis test used to identify features differing significantly across classes
  • Mutual Information used to capture non-linear relationships with the target

Categorical Features

  • Chi-square test for dependence with the target variable
  • Cramer’s V to measure strength of association

Only statistically informative features were retained for model training.

statistical scores for all the columns statistical scores of all the columns


Model Training

Problem Setup

  • Task: Multiclass classification (A, B, C, E)
  • Inputs: Selected numerical and categorical features
  • Major challenge: Severe class imbalance with very small minority classes

Models Used

  • Logistic Regression (baseline, class-weighted)
  • Random Forest (class-weighted, randomized search)
  • XGBoost (randomized search with stratified cv)
  • CatBoost (native categorical handling)

Training Strategy

  • Stratified train test split
  • Stratified 5-fold cross-validation
  • Cost-sensitive learning using class weights
  • Leakage-safe preprocessing via pipelines

Evaluation Metrics

Accuracy was not used for model selection due to imbalance.

Primary metrics:

  • Macro F1 score
  • Macro Recall

These metrics treat all classes equally and penalize models that ignore rare pollution categories.


Results Summary

Model Macro F1 Macro Recall
Logistic Regression ~0.61 ~0.73
Random Forest ~0.53 ~0.50
XGBoost ~0.62 ~0.62
CatBoost ~0.70 ~0.74

CatBoost achieved the best balance between majority and minority class performance.

evaluation scores across models evaluation scores across models


Why Scores Do Not Exceed ~0.75

  • Extremely small sample sizes for minority classes
  • High variance of macro metrics with low per-class support
  • Overlapping physico-chemical ranges across classes
  • Measurement noise inherent to environmental field data
  • Evaluation performed on a strict, unseen test set

The observed ceiling is data-driven, not a modeling limitation.


Limitations

  • Very limited samples for some classes
  • No temporal modeling or seasonal aggregation
  • No external hydrological or land-use features
  • No class-specific threshold tuning

Key Takeaway

This project demonstrates a realistic and statistically rigorous approach to multiclass water quality classification using noisy, imbalanced, real-world environmental data, prioritizing minority class sensitivity over misleading accuracy metrics.