A deep learning-based video classification system for detecting shoplifting behavior in surveillance footage. The project includes both a custom CNN-LSTM architecture built from scratch and a fine-tuned pretrained R3D-18 (3D ResNet) model, along with a production-ready Django web application for deployment.
- Overview
- Features
- Architecture
- Dataset
- Repository Structure
- Model Training
- Model Performance
- Deployment
- Requirements
- License
This project implements a video classification model that can automatically detect shoplifting behavior from surveillance camera footage. The model combines Convolutional Neural Networks (CNN) for spatial feature extraction and Long Short-Term Memory (LSTM) networks for temporal pattern recognition.
- Custom CNN-LSTM Architecture: Built from scratch without pretrained models
- Pretrained Model Fine-tuning: R3D-18 (3D ResNet) implementation for enhanced performance
- Temporal Analysis: Processes 16 uniformly sampled frames per video
- Class Imbalance Handling: Implements class weighting and balanced loss functions
- Data Augmentation: Random flips, rotations, and color jittering for better generalization
- GPU Acceleration: Full CUDA support for faster training
- Django Deployment: Production-ready web application for video upload and prediction
The model consists of three main components:
- 4 convolutional blocks with batch normalization
- Progressive channel expansion (64 β 128 β 256 β 512)
- Max pooling for spatial dimension reduction
- Adaptive average pooling for flexible input sizes
- 2-layer bidirectional LSTM
- Hidden size: 256
- Captures temporal dependencies across video frames
- Fully connected layers with dropout (0.5)
- ReLU activation
- Binary classification output (Shoplifting vs Normal)
Model Parameters: ~15-20 million trainable parameters
- 3D ResNet-18 architecture pretrained on Kinetics dataset
- Modified final layer for binary classification
- Dropout (0.5) for regularization
- Optimized for video understanding tasks
- Classes: 2 (Non-Shoplifter, Shoplifter)
- Train/Val/Test Split: 80/10/10
- Class Distribution:
- Non-Shoplifter: ~62%
- Shoplifter: ~38%
- Video Processing: 16 frames per video, resized to 112Γ112 pixels
- Stratified Splitting: Maintains class balance across all splits
shoplifting-detection/
βββ deployment/
β βββ shoplifting_detection/
β βββ manage.py
β βββ models/
β β βββ best_3D_CNN_model.pth # Trained model weights
β βββ detector/
β β βββ forms.py # Video upload form
β β βββ model_utils.py # Model loading & inference
β β βββ views.py # Django views
β β βββ urls.py # URL routing
β β βββ templates/
β β βββ detector/
β β βββ index.html # Web interface
β βββ shoplifting_detection/
β β βββ settings.py
β β βββ urls.py
β β βββ wsgi.py
β βββ requirements.txt
β
βββ assets/
β βββ confusion_matrix.png
β βββ training_curves.png
β
βββ shoplifting-FromScratch.ipynb # Train CNN-LSTM from scratch
βββ shoplifting-PreTrained.ipynb # Fine-tune pretrained R3D-18
βββ README.md
Located in notebooks/custom_cnn_lstm_training.ipynb
Training Configuration:
- Optimizer: Adam (lr=1e-3)
- Loss: Weighted Cross-Entropy
- Batch Size: 16
- Epochs: 10
- Data Augmentation: Random horizontal flip, rotation, color jitter
Usage:
# Load and run the notebook
# Trains a custom CNN-LSTM architecture from scratch
# Saves model as 'best_cnn_lstm_model.pth'
Located in notebooks/r3d18_finetuning.ipynb
Key Features:
- Uses pretrained weights from Kinetics-400 dataset
- Transfer learning for faster convergence
- Modified classification head for binary output
Training Configuration:
- Base Model: R3D-18 (3D ResNet)
- Optimizer: Adam (lr=1e-3)
- Batch Size: 16
- Epochs: 10
- Preprocessing: Uniform frame sampling, resize to 112Γ112
The project includes a Django-based web application for easy model deployment and inference.
Features:
- Upload surveillance videos through web interface
- Real-time prediction with confidence scores
- Displays probability distribution for both classes
- Video playback with prediction results
- Navigate to deployment folder:
cd deployment/shoplifting_detection
- Install dependencies:
pip install -r requirements.txt
- Place your trained model:
# Copy your .pth model file to:
models/best_3D_CNN_model.pth
- Run migrations:
python manage.py migrate
- Start the server:
python manage.py runserver
- Access the application:
Open browser: http://127.0.0.1:8000/
Model Loading:
- Model loaded once at startup (global variable)
- Avoids redundant loading for each request
- Set to evaluation mode for inference
Video Processing Pipeline:
- User uploads video via web form
- Video saved temporarily to media folder
- Extract 16 uniformly sampled frames
- Preprocess frames (resize, normalize)
- Run inference with loaded model
- Return prediction with confidence scores
- Display results on web interface
Prediction Output:
{
"prediction": "Shoplifting" | "Normal",
"confidence": 95.67,
"probabilities": {
"normal": 4.33,
"shoplifting": 95.67
}
}
torch>=1.9.0
torchvision>=0.10.0
opencv-python-headless>=4.5.0
numpy>=1.19.0
scikit-learn>=0.24.0
matplotlib>=3.3.0
seaborn>=0.11.0
tqdm>=4.60.0
Django==4.2.0
torch==2.0.0
torchvision==0.15.0
opencv-python==4.8.0.76
numpy==1.24.3
Pillow==10.0.0
This project is licensed under the MIT License - see the LICENSE file for details.
- R3D-18 architecture from torchvision.models.video
- Dataset: Shoplifting Videos Dataset
- Built with PyTorch and Django frameworks

