TinySigLIP

📌 A knowledge distillation framework for creating compact vision-language models, inspired by TinyCLIP. TinySigLIP distills knowledge from large SigLIP teacher models to smaller, efficient student models while maintaining competitive performance.

Highlights

🎯 Efficient Distillation: Multiple distillation strategies for effective knowledge transfer
📉 Parameter Efficient: Support for smaller vocabularies, saving ~86M parameters compared to full models
⚡ Fast Inference: Compact models with faster inference speed
🔄 Easy to Use: Simple training and evaluation pipeline
📊 Experiment Tracking: Real-time training metrics with Weights & Biases integration

News

Coming Soon: Model Zoo with pre-trained checkpoints
In Progress: Training and evaluation on large-scale datasets

Model Zoo

Pre-trained models will be released here. Checkpoints are currently under development.

Model	Teacher	Student Vision	Student Text	ImageNet-1K Zero-shot	Parameters	Status
[TinySigLIP-ViT-Tiny]	SigLIP-Base	ViT-Tiny/16	19M	[To be filled]	~39M	🚧 Training
[TinySigLIP-ViT-Small]	SigLIP-Base	ViT-Small/16	19M	[To be filled]	~60M	📋 Planned
[TinySigLIP-ResNet]	SigLIP-Base	ResNet-50	19M	[To be filled]	~45M	📋 Planned

Note: Model checkpoints will be available soon. Training progress can be monitored on W&B Dashboard.

Experiment Tracking

Training progress and metrics are tracked using Weights & Biases. View the experiment dashboard:

🔗 View Experiments on W&B

The dashboard includes real-time training metrics, loss curves, model checkpoints, and hyperparameter configurations for all experiments.

Getting Started

🔰 Here is the setup tutorial, evaluation and training scripts.

Installation

Install dependencies:

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

Dataset Preparation

COCO 2017 Caption Dataset

For training, you need to download the COCO 2017 Caption dataset. We provide convenient download scripts:

Option 1: Using Bash Script (Recommended - with defaults)

The easiest way is to use the bash script with default parameters. By default, it downloads all splits (train, val, test) and annotations:

# Download with all defaults (downloads ALL splits: train, val, test to ./data/coco)
./download_coco.sh

# Download only specific splits (for quick testing or limited storage)
./download_coco.sh --split val
./download_coco.sh --split train val  # Skip test split

# Download to custom directory
./download_coco.sh --data-dir /path/to/coco

# Download only annotations (if images are already downloaded)
./download_coco.sh --annotations-only

# Download and cleanup zip files to save space
./download_coco.sh --cleanup

Option 2: Using Python Script Directly

You can also use the Python script directly:

# Download all splits (train, val, test) and annotations (default behavior)
python download_coco.py --data-dir ./data/coco

# Download only specific splits (for quick testing or limited storage)
python download_coco.py --data-dir ./data/coco --split val
python download_coco.py --data-dir ./data/coco --split train val  # Skip test split

# Download only annotations (if images are already downloaded)
python download_coco.py --data-dir ./data/coco --annotations-only

# Download specific splits
python download_coco.py --data-dir ./data/coco --split train val

# Clean up zip files after extraction to save space
python download_coco.py --data-dir ./data/coco --cleanup

The script will:

Download all splits by default: training images (~19GB), validation images (~1GB), test images (~6GB), and annotations (~241MB)
Total size: ~26GB (all splits) or ~20GB (train + val only)
Extract files to organized directories
Show download progress with progress bars
Zero configuration needed - just run ./download_coco.sh and start training!

Directory Structure After Download:

./data/coco/
├── images/
│   ├── train2017/    # Training images (~118K images, ~19GB)
│   ├── val2017/      # Validation images (~5K images, ~1GB)
│   └── test2017/      # Test images (~40K images, ~6GB)
├── annotations/
│   ├── captions_train2017.json
│   └── captions_val2017.json
└── downloads/        # Zip files (can be removed with --cleanup)

Configuration:

✅ No configuration needed! The default configuration in config/config.yaml is already set up:

dataset:
  coco_root: "data/coco/images"  # Relative to project root
  coco_ann_file: "data/coco/annotations/captions_train2017.json"
  split: "train"  # or "val" or "test"

If you use the default download location (./data/coco), just run ./download_coco.sh and start training - everything is pre-configured!

The paths are relative to the project root and will be automatically resolved. If you downloaded to a custom location, you can update the paths in config/config.yaml using either:

Relative paths (relative to project root): "data/coco/images"
Absolute paths: "/path/to/coco/images"

Quick Start

1. Training

Start training with the universal training script:

./train.sh

The script automatically detects your hardware and training setup:

Multiple GPUs: Automatically uses distributed training (DDP) with all available GPUs
Single GPU: Runs single GPU training
CPU/MacBook: Falls back to CPU training

You can also run directly:

python train.py

Manual Multi-GPU Training:

If you want to manually specify the number of GPUs:

# Use torchrun directly
torchrun --nproc_per_node=4 train.py

# Or specify via environment variable
NUM_GPUS=4 ./train.sh

How it works:

The training script automatically detects the distributed training environment and will:

Initialize distributed training using NCCL backend (when multiple GPUs detected)
Wrap the model with DistributedDataParallel (DDP) for multi-GPU training
Use DistributedSampler for data distribution across GPUs
Only save checkpoints and logs on the main process (rank 0)

Note: The batch size in config/config.yaml is the batch size per GPU. The effective batch size will be batch_size × num_gpus when using multiple GPUs.

The training script uses Hydra for configuration management. Modify config/config.yaml to adjust hyperparameters.

Note: Training metrics are automatically logged to Weights & Biases. Make sure you have configured your W&B API key if you want to track experiments:

wandb login

2. Evaluation

Evaluate your trained model on ImageNet-1k zero-shot classification:

python eval_imagenet1k.py \
    --imagenet-val /path/to/imagenet/val \
    --resume /path/to/checkpoint.pt \
    --batch-size 32 \
    --num-workers 4

Arguments:

--imagenet-val: Path to ImageNet validation set directory
--resume: Path to checkpoint file (.pt file saved during training)
--batch-size: Batch size for evaluation (default: 32)
--num-workers: Number of data loading workers (default: 4)
--device: Device to use (default: cuda)
--logit-scale: Optional logit scale (temperature). If not specified, uses value from checkpoint.

Example:

# Evaluate a trained model
python eval_imagenet1k.py \
    --imagenet-val ./ImageNet \
    --resume ./outputs/2025-11-29_20-15-10/checkpoint.pt \
    --batch-size 64

The evaluation script will:

Load the checkpoint and restore model configuration
Load or create the processor from checkpoint directory
Generate text prompts for all 1000 ImageNet classes (e.g., "a photo of a {class_name}")
Compute text features for all classes
Evaluate on ImageNet validation set and report Top-1 and Top-5 accuracy

Note: The checkpoint directory should contain a processor/ subdirectory (saved automatically during training) for proper text tokenization. If not available, the script will attempt to create a processor from the checkpoint configuration.

3. Image-Text Retrieval Evaluation (COCO)

Evaluate image-text retrieval performance on COCO dataset:

python eval_retrieval.py \
    --resume /path/to/checkpoint.pt \
    --split val \
    --coco-root /path/to/coco/images \
    --coco-ann-file /path/to/coco/annotations/captions_val2017.json \
    --batch-size 32 \
    --num-workers 4

Arguments:

--resume: Path to checkpoint file (.pt file saved during training)
--split: Dataset split to evaluate on (val or test, default: val)
--coco-root: Root directory where COCO images are stored
--coco-ann-file: Path to COCO annotation JSON file
--batch-size: Batch size for evaluation (default: 32)
--num-workers: Number of data loading workers (default: 4)
--device: Device to use (default: cuda)
--max-samples: Maximum number of samples to evaluate (default: None for all)

Model Architecture

The student model consists of a vision encoder (from timm) and a lightweight text encoder:

graph TB
    subgraph Vision["Vision Path"]
        direction TB
        I[Input Images<br/>B×3×H×W] --> VE[Vision Encoder<br/>timm backbone]
        VE --> VP[Vision Projection<br/>Linear]
        VP --> NF1[L2 Normalize]
    end

    subgraph Text["Text Path"]
        direction TB
        T[Input Text IDs<br/>B×seq_len] --> TE[Text Embedding<br/>vocab_size×dim]
        TE --> TP[Positional Embedding]
        TP --> TT[Text Transformer<br/>L layers]
        TT --> TP2[Text Projection<br/>Linear]
        TP2 --> NF2[L2 Normalize]
    end

    NF1 --> SIM[Similarity Matrix<br/>B×B]
    NF2 --> SIM

    style Vision fill:#f8fafc,stroke:#dc2626,stroke-width:3px
    style Text fill:#f8fafc,stroke:#16a34a,stroke-width:3px
    style I fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style T fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style VE fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
    style TE fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
    style TP fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
    style TT fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style VP fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style TP2 fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style NF1 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style NF2 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style SIM fill:#059669,stroke:#047857,stroke-width:2px,color:#fff

Training Approach

The distillation process uses multiple loss components to transfer knowledge from teacher to student:

graph TB
    subgraph T["Teacher Model"]
        direction TB
        TI[Teacher Images] --> TF[Teacher Features]
        TT[Teacher Text] --> TF
    end

    subgraph S["Student Model"]
        direction TB
        SI[Student Images] --> SF[Student Features]
        ST[Student Text] --> SF
    end

    TF --> L1[SigLIP Loss<br/>Binary Cross-Entropy]
    SF --> L1

    TF --> L2[CMD Loss<br/>KL Divergence]
    SF --> L2

    TF --> L3[UMD Loss<br/>MSE on Features]
    SF --> L3

    L1 --> TL[Total Loss]
    L2 --> TL
    L3 --> TL

    TL --> OPT[Optimizer]

    style T fill:#f8fafc,stroke:#7c3aed,stroke-width:3px
    style S fill:#f8fafc,stroke:#2563eb,stroke-width:3px
    style TI fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style TT fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style TF fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style SI fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style ST fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style SF fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style L1 fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
    style L2 fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
    style L3 fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
    style TL fill:#991b1b,stroke:#7f1d1d,stroke-width:3px,color:#fff
    style OPT fill:#059669,stroke:#047857,stroke-width:2px,color:#fff

Loss Components

SigLIP Loss (L_SigLIP): Binary cross-entropy with sigmoid activation for contrastive learning
```
L_SigLIP = (1/2) × [BCE(σ(S_I2T), Y) + BCE(σ(S_T2I), Y)]
```
Where:
- σ is the sigmoid function
- S_I2T and S_T2I are similarity matrices (Image-to-Text and Text-to-Image)
- Y is the ground truth label matrix
- BCE is Binary Cross-Entropy
Cross-Modal Distillation (CMD) (L_CMD): KL divergence between teacher and student similarity distributions
```
L_CMD = KL(P_T(S_T) || P_S(S_S))
```
Where:
- P_T and P_S are probability distributions over teacher and student similarities
- S_T and S_S are similarity matrices from teacher and student models
Uni-Modal Distillation (UMD) (L_UMD): MSE loss on normalized features from vision and text encoders
```
L_UMD = (1/2) × [MSE(f_V^T, f_V^S) + MSE(f_T^T, f_T^S)]
```
Where:
- f_V^T and f_V^S are normalized vision features from teacher and student
- f_T^T and f_T^S are normalized text features from teacher and student

Total Loss:

L_total = λ_SigLIP × L_SigLIP + λ_CMD × L_CMD + λ_UMD × L_UMD

Project Structure

tinysiglip/model.py: Student model definition (uses timm for vision encoder)
tinysiglip/loss.py: Distillation loss functions (SigLIP loss + CMD + UMD + Embedding Mimicking)
tinysiglip/embedding_distillation.py: Token embedding layer distillation utilities (token mapping and weight transfer)
tinysiglip/coco_dataset.py: COCO dataset implementation
tinysiglip/fake_dataset.py: Dummy dataset for testing
tinysiglip/processor.py: Data preprocessing utilities
tinysiglip/metrics.py: Evaluation metrics
train.py: Training script with Hydra configuration
eval_imagenet1k.py: ImageNet-1K zero-shot classification evaluation
eval_retrieval.py: Image-text retrieval evaluation

Configuration

Modify config/config.yaml to customize training:

Key Configuration Options

Teacher Model (teacher.model_name):
- Default: google/siglip-base-patch16-224
- Any SigLIP model from HuggingFace can be used
Student Vision Model (student.vision_model_name):
- Any timm model name (e.g., vit_tiny_patch16_224, resnet50, efficientnet_b0)
- Pre-trained weights will be loaded automatically
Student Vocabulary Size (student.vocab_size):
- Default: 32000 (for English-only models)
- Can be set smaller than teacher model to save parameters (e.g., 32K vs 256K saves ~86M params)
- Common sizes: 32000 (English BPE), 50257 (GPT-2), 49152 (English CLIP)
- Set to null to use the same vocabulary size as teacher model
Training Hyperparameters:
- Batch size, learning rate, warmup steps, etc.
- Loss weights: lambda_siglip, lambda_cmd, lambda_umd

See config/config.yaml for full configuration options.

Different Vocabulary Sizes

The student model can use a different vocabulary size than the teacher model, which is useful for creating smaller English-specific models.

How it works:

When using real data (USE_REAL_DATA=True), the code automatically:
- Loads teacher and student tokenizers
- Finds shared tokens between vocabularies using create_token_mapping()
- Transfers embedding weights for shared tokens using transfer_embedding_weights()
When using dummy data (USE_REAL_DATA=False), the code uses create_dummy_token_mapping() for testing purposes

For real applications:

Set USE_REAL_DATA=True in the configuration
Specify student.tokenizer_name if using a different tokenizer than the teacher
The training script automatically handles tokenizer loading and token mapping

Token Embedding Layer Distillation / Weight Transfer

When the student model uses a smaller vocabulary than the teacher model (e.g., 32K vs 256K), there are two methods to transfer knowledge:

Method 1: Weight Transfer (Recommended) ⭐

Core Advantage: Zero runtime overhead, one-time initialization

Before training starts, directly copy weights of shared tokens from teacher model's embedding layer to student model
No need to compute additional loss terms in the training loop
Achieves maximum parameter compression

Implementation:

Set USE_WEIGHT_TRANSFER = True (default)
Weights are automatically transferred after model initialization
Remaining non-shared tokens are randomly initialized and learned during training

Parameter Savings: Using a 32K vocabulary compared to 256K can save approximately 86M parameters (in the embedding layer)

Method 2: Embedding Mimicking Loss

Core Idea:

Find shared tokens in student and teacher vocabularies
During training, continuously make student model's embeddings for these shared tokens mimic teacher model's embeddings
Implemented via MSE loss: L_Emb = MSE(Emb_S(shared tokens), Emb_T(shared tokens))

Implementation:

Set USE_WEIGHT_TRANSFER = False
Weight controlled by LAMBDA_EMBEDDING (default 0.0, as weight transfer is recommended)
Continuously computed during training

Using Real Tokenizers

In real applications, you can use actual tokenizers to find shared tokens and transfer weights:

from tinysiglip.embedding_distillation import create_token_mapping, transfer_embedding_weights
from transformers import AutoTokenizer

# Load tokenizers
student_tokenizer = AutoTokenizer.from_pretrained("your-student-tokenizer")
teacher_tokenizer = AutoTokenizer.from_pretrained(TEACHER_MODEL_NAME)

# Step 1: Find shared tokens between vocabularies
shared_student_indices, shared_teacher_indices = create_token_mapping(
    teacher_tokenizer=teacher_tokenizer,
    student_tokenizer=student_tokenizer,
    verbose=True,
)

# Step 2: Transfer weights for shared tokens
transferred_count = transfer_embedding_weights(
    student_embedding_layer=student_model.text_embedding,
    teacher_embedding_layer=teacher_model.text_model.embeddings.token_embedding,
    shared_student_indices=shared_student_indices,
    shared_teacher_indices=shared_teacher_indices,
    verbose=True,
)

Note: The training script (train.py) automatically uses real tokenizers when USE_REAL_DATA=True and tokenizers are available. Otherwise, it falls back to dummy token mapping for testing purposes.

Ablation Study

Experimental Setup

Teacher Model: [To be filled]
Student Model: [To be filled]
Dataset: [To be filled]
Evaluation Metrics: [To be filled]

Results

Configuration	SigLIP Loss	CMD Loss	UMD Loss	Embedding Transfer	Image-Text Retrieval (R@1)	Text-Image Retrieval (R@1)	Parameters
Baseline (SigLIP only)	✓	✗	✗	✗	[To be filled]	[To be filled]	[To be filled]
+ CMD	✓	✓	✗	✗	[To be filled]	[To be filled]	[To be filled]
+ UMD	✓	✓	✓	✗	[To be filled]	[To be filled]	[To be filled]
+ Embedding Transfer	✓	✓	✓	✓	[To be filled]	[To be filled]	[To be filled]

Analysis

[To be filled: Analysis of ablation study results, including:

Impact of each loss component
Effectiveness of embedding weight transfer vs. mimicking loss
Trade-offs between model size and performance
Comparison with other distillation methods]

References

Core Papers

SigLIP: Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343. arXiv:2303.15343
TinyCLIP: Wu, H., et al. (2023). TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2301.12562
CLIP: Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML). arXiv:2103.00020

Knowledge Distillation

Knowledge Distillation: Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. arXiv:1503.02531
Feature Distillation: Romero, A., et al. (2014). FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. arXiv:1412.6550

Vision-Language Models

ALIGN: Jia, C., et al. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. International Conference on Machine Learning (ICML). arXiv:2102.05918
BLIP: Li, J., et al. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning (ICML). arXiv:2201.12086

Model Compression

Model Compression Survey: Choudhary, T., et al. (2020). A Comprehensive Survey on Model Compression and Acceleration. Artificial Intelligence Review, 53(7), 5113-5155. Link

License

See LICENSE file for details.

Citation

If you use this code in your research, please cite:

@misc{tinysiglip2024,
  title={TinySigLIP: SigLIP Model Distillation with Timm-based Student Architecture},
  author={[Your Name]},
  year={2024},
  howpublished={\url{https://github.com/yourusername/TinySigLIP}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.vscode		.vscode
config		config
doc		doc
tests		tests
tinysiglip		tinysiglip
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
download_coco.py		download_coco.py
download_coco.sh		download_coco.sh
eval.py		eval.py
eval.sh		eval.sh
main.py		main.py
prepare_data.py		prepare_data.py
prepare_data.sh		prepare_data.sh
pyproject.toml		pyproject.toml
run.sh		run.sh
train.py		train.py
train.sh		train.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinySigLIP

Highlights

News

Model Zoo

Experiment Tracking

Getting Started

Installation

Dataset Preparation

Quick Start

Model Architecture

Training Approach

Loss Components

Project Structure

Configuration

Key Configuration Options

Different Vocabulary Sizes

Token Embedding Layer Distillation / Weight Transfer

Method 1: Weight Transfer (Recommended) ⭐

Method 2: Embedding Mimicking Loss

Using Real Tokenizers

Ablation Study

Experimental Setup

Results

Analysis

References

Core Papers

Knowledge Distillation

Vision-Language Models

Model Compression

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

duoan/TinySigLIP

Folders and files

Latest commit

History

Repository files navigation

TinySigLIP

Highlights

News

Model Zoo

Experiment Tracking

Getting Started

Installation

Dataset Preparation

Quick Start

Model Architecture

Training Approach

Loss Components

Project Structure

Configuration

Key Configuration Options

Different Vocabulary Sizes

Token Embedding Layer Distillation / Weight Transfer

Method 1: Weight Transfer (Recommended) ⭐

Method 2: Embedding Mimicking Loss

Using Real Tokenizers

Ablation Study

Experimental Setup

Results

Analysis

References

Core Papers

Knowledge Distillation

Vision-Language Models

Model Compression

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages