📌 A knowledge distillation framework for creating compact vision-language models, inspired by TinyCLIP. TinySigLIP distills knowledge from large SigLIP teacher models to smaller, efficient student models while maintaining competitive performance.
- 🎯 Efficient Distillation: Multiple distillation strategies for effective knowledge transfer
- 📉 Parameter Efficient: Support for smaller vocabularies, saving ~86M parameters compared to full models
- ⚡ Fast Inference: Compact models with faster inference speed
- 🔄 Easy to Use: Simple training and evaluation pipeline
- 📊 Experiment Tracking: Real-time training metrics with Weights & Biases integration
- Coming Soon: Model Zoo with pre-trained checkpoints
- In Progress: Training and evaluation on large-scale datasets
Pre-trained models will be released here. Checkpoints are currently under development.
| Model | Teacher | Student Vision | Student Text | ImageNet-1K Zero-shot | Parameters | Status |
|---|---|---|---|---|---|---|
| [TinySigLIP-ViT-Tiny] | SigLIP-Base | ViT-Tiny/16 | 19M | [To be filled] | ~39M | 🚧 Training |
| [TinySigLIP-ViT-Small] | SigLIP-Base | ViT-Small/16 | 19M | [To be filled] | ~60M | 📋 Planned |
| [TinySigLIP-ResNet] | SigLIP-Base | ResNet-50 | 19M | [To be filled] | ~45M | 📋 Planned |
Note: Model checkpoints will be available soon. Training progress can be monitored on W&B Dashboard.
Training progress and metrics are tracked using Weights & Biases. View the experiment dashboard:
The dashboard includes real-time training metrics, loss curves, model checkpoints, and hyperparameter configurations for all experiments.
🔰 Here is the setup tutorial, evaluation and training scripts.
Install dependencies:
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtCOCO 2017 Caption Dataset
For training, you need to download the COCO 2017 Caption dataset. We provide convenient download scripts:
Option 1: Using Bash Script (Recommended - with defaults)
The easiest way is to use the bash script with default parameters. By default, it downloads all splits (train, val, test) and annotations:
# Download with all defaults (downloads ALL splits: train, val, test to ./data/coco)
./download_coco.sh
# Download only specific splits (for quick testing or limited storage)
./download_coco.sh --split val
./download_coco.sh --split train val # Skip test split
# Download to custom directory
./download_coco.sh --data-dir /path/to/coco
# Download only annotations (if images are already downloaded)
./download_coco.sh --annotations-only
# Download and cleanup zip files to save space
./download_coco.sh --cleanupOption 2: Using Python Script Directly
You can also use the Python script directly:
# Download all splits (train, val, test) and annotations (default behavior)
python download_coco.py --data-dir ./data/coco
# Download only specific splits (for quick testing or limited storage)
python download_coco.py --data-dir ./data/coco --split val
python download_coco.py --data-dir ./data/coco --split train val # Skip test split
# Download only annotations (if images are already downloaded)
python download_coco.py --data-dir ./data/coco --annotations-only
# Download specific splits
python download_coco.py --data-dir ./data/coco --split train val
# Clean up zip files after extraction to save space
python download_coco.py --data-dir ./data/coco --cleanupThe script will:
- Download all splits by default: training images (~19GB), validation images (~1GB), test images (~6GB), and annotations (~241MB)
- Total size: ~26GB (all splits) or ~20GB (train + val only)
- Extract files to organized directories
- Show download progress with progress bars
- Zero configuration needed - just run
./download_coco.shand start training!
Directory Structure After Download:
./data/coco/
├── images/
│ ├── train2017/ # Training images (~118K images, ~19GB)
│ ├── val2017/ # Validation images (~5K images, ~1GB)
│ └── test2017/ # Test images (~40K images, ~6GB)
├── annotations/
│ ├── captions_train2017.json
│ └── captions_val2017.json
└── downloads/ # Zip files (can be removed with --cleanup)
Configuration:
✅ No configuration needed! The default configuration in config/config.yaml is already set up:
dataset:
coco_root: "data/coco/images" # Relative to project root
coco_ann_file: "data/coco/annotations/captions_train2017.json"
split: "train" # or "val" or "test"If you use the default download location (./data/coco), just run ./download_coco.sh and start training - everything is pre-configured!
The paths are relative to the project root and will be automatically resolved. If you downloaded to a custom location, you can update the paths in config/config.yaml using either:
- Relative paths (relative to project root):
"data/coco/images" - Absolute paths:
"/path/to/coco/images"
1. Training
Start training with the universal training script:
./train.shThe script automatically detects your hardware and training setup:
- Multiple GPUs: Automatically uses distributed training (DDP) with all available GPUs
- Single GPU: Runs single GPU training
- CPU/MacBook: Falls back to CPU training
You can also run directly:
python train.pyManual Multi-GPU Training:
If you want to manually specify the number of GPUs:
# Use torchrun directly
torchrun --nproc_per_node=4 train.py
# Or specify via environment variable
NUM_GPUS=4 ./train.shHow it works:
The training script automatically detects the distributed training environment and will:
- Initialize distributed training using NCCL backend (when multiple GPUs detected)
- Wrap the model with DistributedDataParallel (DDP) for multi-GPU training
- Use DistributedSampler for data distribution across GPUs
- Only save checkpoints and logs on the main process (rank 0)
Note: The batch size in config/config.yaml is the batch size per GPU. The effective batch size will be batch_size × num_gpus when using multiple GPUs.
The training script uses Hydra for configuration management. Modify config/config.yaml to adjust hyperparameters.
Note: Training metrics are automatically logged to Weights & Biases. Make sure you have configured your W&B API key if you want to track experiments:
wandb login2. Evaluation
Evaluate your trained model on ImageNet-1k zero-shot classification:
python eval_imagenet1k.py \
--imagenet-val /path/to/imagenet/val \
--resume /path/to/checkpoint.pt \
--batch-size 32 \
--num-workers 4Arguments:
--imagenet-val: Path to ImageNet validation set directory--resume: Path to checkpoint file (.ptfile saved during training)--batch-size: Batch size for evaluation (default: 32)--num-workers: Number of data loading workers (default: 4)--device: Device to use (default:cuda)--logit-scale: Optional logit scale (temperature). If not specified, uses value from checkpoint.
Example:
# Evaluate a trained model
python eval_imagenet1k.py \
--imagenet-val ./ImageNet \
--resume ./outputs/2025-11-29_20-15-10/checkpoint.pt \
--batch-size 64The evaluation script will:
- Load the checkpoint and restore model configuration
- Load or create the processor from checkpoint directory
- Generate text prompts for all 1000 ImageNet classes (e.g., "a photo of a {class_name}")
- Compute text features for all classes
- Evaluate on ImageNet validation set and report Top-1 and Top-5 accuracy
Note: The checkpoint directory should contain a processor/ subdirectory (saved automatically during training) for proper text tokenization. If not available, the script will attempt to create a processor from the checkpoint configuration.
3. Image-Text Retrieval Evaluation (COCO)
Evaluate image-text retrieval performance on COCO dataset:
python eval_retrieval.py \
--resume /path/to/checkpoint.pt \
--split val \
--coco-root /path/to/coco/images \
--coco-ann-file /path/to/coco/annotations/captions_val2017.json \
--batch-size 32 \
--num-workers 4Arguments:
--resume: Path to checkpoint file (.ptfile saved during training)--split: Dataset split to evaluate on (valortest, default:val)--coco-root: Root directory where COCO images are stored--coco-ann-file: Path to COCO annotation JSON file--batch-size: Batch size for evaluation (default: 32)--num-workers: Number of data loading workers (default: 4)--device: Device to use (default:cuda)--max-samples: Maximum number of samples to evaluate (default: None for all)
The student model consists of a vision encoder (from timm) and a lightweight text encoder:
graph TB
subgraph Vision["Vision Path"]
direction TB
I[Input Images<br/>B×3×H×W] --> VE[Vision Encoder<br/>timm backbone]
VE --> VP[Vision Projection<br/>Linear]
VP --> NF1[L2 Normalize]
end
subgraph Text["Text Path"]
direction TB
T[Input Text IDs<br/>B×seq_len] --> TE[Text Embedding<br/>vocab_size×dim]
TE --> TP[Positional Embedding]
TP --> TT[Text Transformer<br/>L layers]
TT --> TP2[Text Projection<br/>Linear]
TP2 --> NF2[L2 Normalize]
end
NF1 --> SIM[Similarity Matrix<br/>B×B]
NF2 --> SIM
style Vision fill:#f8fafc,stroke:#dc2626,stroke-width:3px
style Text fill:#f8fafc,stroke:#16a34a,stroke-width:3px
style I fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
style T fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
style VE fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
style TE fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
style TP fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
style TT fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
style VP fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
style TP2 fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
style NF1 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
style NF2 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
style SIM fill:#059669,stroke:#047857,stroke-width:2px,color:#fff
The distillation process uses multiple loss components to transfer knowledge from teacher to student:
graph TB
subgraph T["Teacher Model"]
direction TB
TI[Teacher Images] --> TF[Teacher Features]
TT[Teacher Text] --> TF
end
subgraph S["Student Model"]
direction TB
SI[Student Images] --> SF[Student Features]
ST[Student Text] --> SF
end
TF --> L1[SigLIP Loss<br/>Binary Cross-Entropy]
SF --> L1
TF --> L2[CMD Loss<br/>KL Divergence]
SF --> L2
TF --> L3[UMD Loss<br/>MSE on Features]
SF --> L3
L1 --> TL[Total Loss]
L2 --> TL
L3 --> TL
TL --> OPT[Optimizer]
style T fill:#f8fafc,stroke:#7c3aed,stroke-width:3px
style S fill:#f8fafc,stroke:#2563eb,stroke-width:3px
style TI fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
style TT fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
style TF fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
style SI fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
style ST fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
style SF fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
style L1 fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
style L2 fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
style L3 fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
style TL fill:#991b1b,stroke:#7f1d1d,stroke-width:3px,color:#fff
style OPT fill:#059669,stroke:#047857,stroke-width:2px,color:#fff
-
SigLIP Loss (LSigLIP): Binary cross-entropy with sigmoid activation for contrastive learning
L_SigLIP = (1/2) × [BCE(σ(S_I2T), Y) + BCE(σ(S_T2I), Y)]Where:
σis the sigmoid functionS_I2TandS_T2Iare similarity matrices (Image-to-Text and Text-to-Image)Yis the ground truth label matrixBCEis Binary Cross-Entropy
-
Cross-Modal Distillation (CMD) (LCMD): KL divergence between teacher and student similarity distributions
L_CMD = KL(P_T(S_T) || P_S(S_S))Where:
P_TandP_Sare probability distributions over teacher and student similaritiesS_TandS_Sare similarity matrices from teacher and student models
-
Uni-Modal Distillation (UMD) (LUMD): MSE loss on normalized features from vision and text encoders
L_UMD = (1/2) × [MSE(f_V^T, f_V^S) + MSE(f_T^T, f_T^S)]Where:
f_V^Tandf_V^Sare normalized vision features from teacher and studentf_T^Tandf_T^Sare normalized text features from teacher and student
-
Total Loss:
L_total = λ_SigLIP × L_SigLIP + λ_CMD × L_CMD + λ_UMD × L_UMD
tinysiglip/model.py: Student model definition (usestimmfor vision encoder)tinysiglip/loss.py: Distillation loss functions (SigLIP loss + CMD + UMD + Embedding Mimicking)tinysiglip/embedding_distillation.py: Token embedding layer distillation utilities (token mapping and weight transfer)tinysiglip/coco_dataset.py: COCO dataset implementationtinysiglip/fake_dataset.py: Dummy dataset for testingtinysiglip/processor.py: Data preprocessing utilitiestinysiglip/metrics.py: Evaluation metricstrain.py: Training script with Hydra configurationeval_imagenet1k.py: ImageNet-1K zero-shot classification evaluationeval_retrieval.py: Image-text retrieval evaluation
Modify config/config.yaml to customize training:
-
Teacher Model (
teacher.model_name):- Default:
google/siglip-base-patch16-224 - Any SigLIP model from HuggingFace can be used
- Default:
-
Student Vision Model (
student.vision_model_name):- Any
timmmodel name (e.g.,vit_tiny_patch16_224,resnet50,efficientnet_b0) - Pre-trained weights will be loaded automatically
- Any
-
Student Vocabulary Size (
student.vocab_size):- Default:
32000(for English-only models) - Can be set smaller than teacher model to save parameters (e.g., 32K vs 256K saves ~86M params)
- Common sizes:
32000(English BPE),50257(GPT-2),49152(English CLIP) - Set to
nullto use the same vocabulary size as teacher model
- Default:
-
Training Hyperparameters:
- Batch size, learning rate, warmup steps, etc.
- Loss weights:
lambda_siglip,lambda_cmd,lambda_umd
See config/config.yaml for full configuration options.
The student model can use a different vocabulary size than the teacher model, which is useful for creating smaller English-specific models.
How it works:
- When using real data (
USE_REAL_DATA=True), the code automatically:- Loads teacher and student tokenizers
- Finds shared tokens between vocabularies using
create_token_mapping() - Transfers embedding weights for shared tokens using
transfer_embedding_weights()
- When using dummy data (
USE_REAL_DATA=False), the code usescreate_dummy_token_mapping()for testing purposes
For real applications:
- Set
USE_REAL_DATA=Truein the configuration - Specify
student.tokenizer_nameif using a different tokenizer than the teacher - The training script automatically handles tokenizer loading and token mapping
When the student model uses a smaller vocabulary than the teacher model (e.g., 32K vs 256K), there are two methods to transfer knowledge:
Core Advantage: Zero runtime overhead, one-time initialization
- Before training starts, directly copy weights of shared tokens from teacher model's embedding layer to student model
- No need to compute additional loss terms in the training loop
- Achieves maximum parameter compression
Implementation:
- Set
USE_WEIGHT_TRANSFER = True(default) - Weights are automatically transferred after model initialization
- Remaining non-shared tokens are randomly initialized and learned during training
Parameter Savings: Using a 32K vocabulary compared to 256K can save approximately 86M parameters (in the embedding layer)
Core Idea:
- Find shared tokens in student and teacher vocabularies
- During training, continuously make student model's embeddings for these shared tokens mimic teacher model's embeddings
- Implemented via MSE loss:
L_Emb = MSE(Emb_S(shared tokens), Emb_T(shared tokens))
Implementation:
- Set
USE_WEIGHT_TRANSFER = False - Weight controlled by
LAMBDA_EMBEDDING(default 0.0, as weight transfer is recommended) - Continuously computed during training
In real applications, you can use actual tokenizers to find shared tokens and transfer weights:
from tinysiglip.embedding_distillation import create_token_mapping, transfer_embedding_weights
from transformers import AutoTokenizer
# Load tokenizers
student_tokenizer = AutoTokenizer.from_pretrained("your-student-tokenizer")
teacher_tokenizer = AutoTokenizer.from_pretrained(TEACHER_MODEL_NAME)
# Step 1: Find shared tokens between vocabularies
shared_student_indices, shared_teacher_indices = create_token_mapping(
teacher_tokenizer=teacher_tokenizer,
student_tokenizer=student_tokenizer,
verbose=True,
)
# Step 2: Transfer weights for shared tokens
transferred_count = transfer_embedding_weights(
student_embedding_layer=student_model.text_embedding,
teacher_embedding_layer=teacher_model.text_model.embeddings.token_embedding,
shared_student_indices=shared_student_indices,
shared_teacher_indices=shared_teacher_indices,
verbose=True,
)Note: The training script (train.py) automatically uses real tokenizers when USE_REAL_DATA=True and tokenizers are available. Otherwise, it falls back to dummy token mapping for testing purposes.
- Teacher Model: [To be filled]
- Student Model: [To be filled]
- Dataset: [To be filled]
- Evaluation Metrics: [To be filled]
| Configuration | SigLIP Loss | CMD Loss | UMD Loss | Embedding Transfer | Image-Text Retrieval (R@1) | Text-Image Retrieval (R@1) | Parameters |
|---|---|---|---|---|---|---|---|
| Baseline (SigLIP only) | ✓ | ✗ | ✗ | ✗ | [To be filled] | [To be filled] | [To be filled] |
| + CMD | ✓ | ✓ | ✗ | ✗ | [To be filled] | [To be filled] | [To be filled] |
| + UMD | ✓ | ✓ | ✓ | ✗ | [To be filled] | [To be filled] | [To be filled] |
| + Embedding Transfer | ✓ | ✓ | ✓ | ✓ | [To be filled] | [To be filled] | [To be filled] |
[To be filled: Analysis of ablation study results, including:
- Impact of each loss component
- Effectiveness of embedding weight transfer vs. mimicking loss
- Trade-offs between model size and performance
- Comparison with other distillation methods]
-
SigLIP: Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343. arXiv:2303.15343
-
TinyCLIP: Wu, H., et al. (2023). TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2301.12562
-
CLIP: Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML). arXiv:2103.00020
-
Knowledge Distillation: Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. arXiv:1503.02531
-
Feature Distillation: Romero, A., et al. (2014). FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. arXiv:1412.6550
-
ALIGN: Jia, C., et al. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. International Conference on Machine Learning (ICML). arXiv:2102.05918
-
BLIP: Li, J., et al. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning (ICML). arXiv:2201.12086
- Model Compression Survey: Choudhary, T., et al. (2020). A Comprehensive Survey on Model Compression and Acceleration. Artificial Intelligence Review, 53(7), 5113-5155. Link
See LICENSE file for details.
If you use this code in your research, please cite:
@misc{tinysiglip2024,
title={TinySigLIP: SigLIP Model Distillation with Timm-based Student Architecture},
author={[Your Name]},
year={2024},
howpublished={\url{https://github.com/yourusername/TinySigLIP}}
}