This project implements a U-Net based deep learning model for pixel-level semantic segmentation on the Oxford-IIIT Pet Dataset. The goal is to segment pets from the background by classifying each pixel into three categories: background, pet, and pet boundary.
The project demonstrates the complete segmentation pipeline including data preprocessing, model training, quantitative evaluation, and qualitative visualization of predictions.
The Oxford-IIIT Pet Dataset contains approximately 7,000 images of cats and dogs with corresponding pixel-level segmentation masks. Each mask is annotated into three classes:
- Background
- Pet
- Pet boundary
Images and masks were resized to 128×128 for efficient training and evaluation.
A U-Net convolutional neural network was used for segmentation. U-Net follows an encoder–decoder structure with skip connections, allowing it to capture both high-level context and fine-grained spatial details.
This architecture is well suited for medical and object segmentation tasks where precise localization is required.
The following preprocessing and training steps were applied:
- Images resized to 128×128
- Pixel values normalized to [0, 1]
- Segmentation masks converted to zero-based class labels
- Random horizontal flipping used for data augmentation
- Training dataset shuffled, cached, and prefetched for efficiency
- Test dataset batched without augmentation
The model was trained using:
- Optimizer: Adam
- Loss: Sparse Categorical Cross-Entropy
- Epochs: 10
Training and validation curves showed stable convergence with minimal overfitting.
The trained model was evaluated on the test dataset using standard segmentation metrics.
| Metric | Value |
|---|---|
| Mean IoU | 0.6800 |
| Mean Dice Coefficient | 0.7858 |
| Pixel Accuracy | 0.8739 |
These metrics were computed using a custom evaluation function that calculates class-wise Intersection-over-Union and Dice scores, then averages them across the dataset.
The model produces visually accurate segmentation masks for pets. Predicted masks align well with ground-truth annotations, with most errors occurring near object boundaries and in images with complex poses or occlusions.
Sample predictions are shown directly in the notebook.
- Inference Speed: ~134 FPS on GPU
- Model Size: ~355 MB
The model is capable of real-time inference on GPU-based systems, making it suitable for fast segmentation pipelines and academic experimentation.
This project demonstrates that a U-Net based architecture can achieve strong performance on the Oxford-IIIT Pet segmentation task using a relatively compact input resolution. The combination of quantitative metrics (IoU, Dice, Pixel Accuracy) and qualitative visualizations provides a comprehensive evaluation of model performance.
- Install dependencies:
pip install tensorflow tensorflow-datasets- Open the notebook:
jupyter notebook pets_unet.ipynbRun all cells to train the model, evaluate performance, and visualize predictions.