Skip to content

TinySigLIP: SigLIP Distillation via Affinity Mimicking and Weight Inheritance

License

Notifications You must be signed in to change notification settings

duoan/TinySigLIP

Repository files navigation

TinySigLIP

📌 A knowledge distillation framework for creating compact vision-language models, inspired by TinyCLIP. TinySigLIP distills knowledge from large SigLIP teacher models to smaller, efficient student models while maintaining competitive performance.

Highlights

  • 🎯 Efficient Distillation: Multiple distillation strategies for effective knowledge transfer
  • 📉 Parameter Efficient: Support for smaller vocabularies, saving ~86M parameters compared to full models
  • Fast Inference: Compact models with faster inference speed
  • 🔄 Easy to Use: Simple training and evaluation pipeline
  • 📊 Experiment Tracking: Real-time training metrics with Weights & Biases integration

News

  • Coming Soon: Model Zoo with pre-trained checkpoints
  • In Progress: Training and evaluation on large-scale datasets

Model Zoo

Pre-trained models will be released here. Checkpoints are currently under development.

Model Teacher Student Vision Student Text ImageNet-1K Zero-shot Parameters Status
[TinySigLIP-ViT-Tiny] SigLIP-Base ViT-Tiny/16 19M [To be filled] ~39M 🚧 Training
[TinySigLIP-ViT-Small] SigLIP-Base ViT-Small/16 19M [To be filled] ~60M 📋 Planned
[TinySigLIP-ResNet] SigLIP-Base ResNet-50 19M [To be filled] ~45M 📋 Planned

Note: Model checkpoints will be available soon. Training progress can be monitored on W&B Dashboard.

Experiment Tracking

Training progress and metrics are tracked using Weights & Biases. View the experiment dashboard:

🔗 View Experiments on W&B

The dashboard includes real-time training metrics, loss curves, model checkpoints, and hyperparameter configurations for all experiments.

Getting Started

🔰 Here is the setup tutorial, evaluation and training scripts.

Installation

Install dependencies:

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

Dataset Preparation

COCO 2017 Caption Dataset

For training, you need to download the COCO 2017 Caption dataset. We provide convenient download scripts:

Option 1: Using Bash Script (Recommended - with defaults)

The easiest way is to use the bash script with default parameters. By default, it downloads all splits (train, val, test) and annotations:

# Download with all defaults (downloads ALL splits: train, val, test to ./data/coco)
./download_coco.sh

# Download only specific splits (for quick testing or limited storage)
./download_coco.sh --split val
./download_coco.sh --split train val  # Skip test split

# Download to custom directory
./download_coco.sh --data-dir /path/to/coco

# Download only annotations (if images are already downloaded)
./download_coco.sh --annotations-only

# Download and cleanup zip files to save space
./download_coco.sh --cleanup

Option 2: Using Python Script Directly

You can also use the Python script directly:

# Download all splits (train, val, test) and annotations (default behavior)
python download_coco.py --data-dir ./data/coco

# Download only specific splits (for quick testing or limited storage)
python download_coco.py --data-dir ./data/coco --split val
python download_coco.py --data-dir ./data/coco --split train val  # Skip test split

# Download only annotations (if images are already downloaded)
python download_coco.py --data-dir ./data/coco --annotations-only

# Download specific splits
python download_coco.py --data-dir ./data/coco --split train val

# Clean up zip files after extraction to save space
python download_coco.py --data-dir ./data/coco --cleanup

The script will:

  • Download all splits by default: training images (~19GB), validation images (~1GB), test images (~6GB), and annotations (~241MB)
  • Total size: ~26GB (all splits) or ~20GB (train + val only)
  • Extract files to organized directories
  • Show download progress with progress bars
  • Zero configuration needed - just run ./download_coco.sh and start training!

Directory Structure After Download:

./data/coco/
├── images/
│   ├── train2017/    # Training images (~118K images, ~19GB)
│   ├── val2017/      # Validation images (~5K images, ~1GB)
│   └── test2017/      # Test images (~40K images, ~6GB)
├── annotations/
│   ├── captions_train2017.json
│   └── captions_val2017.json
└── downloads/        # Zip files (can be removed with --cleanup)

Configuration:

No configuration needed! The default configuration in config/config.yaml is already set up:

dataset:
  coco_root: "data/coco/images"  # Relative to project root
  coco_ann_file: "data/coco/annotations/captions_train2017.json"
  split: "train"  # or "val" or "test"

If you use the default download location (./data/coco), just run ./download_coco.sh and start training - everything is pre-configured!

The paths are relative to the project root and will be automatically resolved. If you downloaded to a custom location, you can update the paths in config/config.yaml using either:

  • Relative paths (relative to project root): "data/coco/images"
  • Absolute paths: "/path/to/coco/images"

Quick Start

1. Training

Start training with the universal training script:

./train.sh

The script automatically detects your hardware and training setup:

  • Multiple GPUs: Automatically uses distributed training (DDP) with all available GPUs
  • Single GPU: Runs single GPU training
  • CPU/MacBook: Falls back to CPU training

You can also run directly:

python train.py

Manual Multi-GPU Training:

If you want to manually specify the number of GPUs:

# Use torchrun directly
torchrun --nproc_per_node=4 train.py

# Or specify via environment variable
NUM_GPUS=4 ./train.sh

How it works:

The training script automatically detects the distributed training environment and will:

  • Initialize distributed training using NCCL backend (when multiple GPUs detected)
  • Wrap the model with DistributedDataParallel (DDP) for multi-GPU training
  • Use DistributedSampler for data distribution across GPUs
  • Only save checkpoints and logs on the main process (rank 0)

Note: The batch size in config/config.yaml is the batch size per GPU. The effective batch size will be batch_size × num_gpus when using multiple GPUs.

The training script uses Hydra for configuration management. Modify config/config.yaml to adjust hyperparameters.

Note: Training metrics are automatically logged to Weights & Biases. Make sure you have configured your W&B API key if you want to track experiments:

wandb login

2. Evaluation

Evaluate your trained model on ImageNet-1k zero-shot classification:

python eval_imagenet1k.py \
    --imagenet-val /path/to/imagenet/val \
    --resume /path/to/checkpoint.pt \
    --batch-size 32 \
    --num-workers 4

Arguments:

  • --imagenet-val: Path to ImageNet validation set directory
  • --resume: Path to checkpoint file (.pt file saved during training)
  • --batch-size: Batch size for evaluation (default: 32)
  • --num-workers: Number of data loading workers (default: 4)
  • --device: Device to use (default: cuda)
  • --logit-scale: Optional logit scale (temperature). If not specified, uses value from checkpoint.

Example:

# Evaluate a trained model
python eval_imagenet1k.py \
    --imagenet-val ./ImageNet \
    --resume ./outputs/2025-11-29_20-15-10/checkpoint.pt \
    --batch-size 64

The evaluation script will:

  1. Load the checkpoint and restore model configuration
  2. Load or create the processor from checkpoint directory
  3. Generate text prompts for all 1000 ImageNet classes (e.g., "a photo of a {class_name}")
  4. Compute text features for all classes
  5. Evaluate on ImageNet validation set and report Top-1 and Top-5 accuracy

Note: The checkpoint directory should contain a processor/ subdirectory (saved automatically during training) for proper text tokenization. If not available, the script will attempt to create a processor from the checkpoint configuration.

3. Image-Text Retrieval Evaluation (COCO)

Evaluate image-text retrieval performance on COCO dataset:

python eval_retrieval.py \
    --resume /path/to/checkpoint.pt \
    --split val \
    --coco-root /path/to/coco/images \
    --coco-ann-file /path/to/coco/annotations/captions_val2017.json \
    --batch-size 32 \
    --num-workers 4

Arguments:

  • --resume: Path to checkpoint file (.pt file saved during training)
  • --split: Dataset split to evaluate on (val or test, default: val)
  • --coco-root: Root directory where COCO images are stored
  • --coco-ann-file: Path to COCO annotation JSON file
  • --batch-size: Batch size for evaluation (default: 32)
  • --num-workers: Number of data loading workers (default: 4)
  • --device: Device to use (default: cuda)
  • --max-samples: Maximum number of samples to evaluate (default: None for all)

Model Architecture

The student model consists of a vision encoder (from timm) and a lightweight text encoder:

graph TB
    subgraph Vision["Vision Path"]
        direction TB
        I[Input Images<br/>B×3×H×W] --> VE[Vision Encoder<br/>timm backbone]
        VE --> VP[Vision Projection<br/>Linear]
        VP --> NF1[L2 Normalize]
    end

    subgraph Text["Text Path"]
        direction TB
        T[Input Text IDs<br/>B×seq_len] --> TE[Text Embedding<br/>vocab_size×dim]
        TE --> TP[Positional Embedding]
        TP --> TT[Text Transformer<br/>L layers]
        TT --> TP2[Text Projection<br/>Linear]
        TP2 --> NF2[L2 Normalize]
    end

    NF1 --> SIM[Similarity Matrix<br/>B×B]
    NF2 --> SIM

    style Vision fill:#f8fafc,stroke:#dc2626,stroke-width:3px
    style Text fill:#f8fafc,stroke:#16a34a,stroke-width:3px
    style I fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style T fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style VE fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
    style TE fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
    style TP fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
    style TT fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style VP fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style TP2 fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style NF1 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style NF2 fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style SIM fill:#059669,stroke:#047857,stroke-width:2px,color:#fff
Loading

Training Approach

The distillation process uses multiple loss components to transfer knowledge from teacher to student:

graph TB
    subgraph T["Teacher Model"]
        direction TB
        TI[Teacher Images] --> TF[Teacher Features]
        TT[Teacher Text] --> TF
    end

    subgraph S["Student Model"]
        direction TB
        SI[Student Images] --> SF[Student Features]
        ST[Student Text] --> SF
    end

    TF --> L1[SigLIP Loss<br/>Binary Cross-Entropy]
    SF --> L1

    TF --> L2[CMD Loss<br/>KL Divergence]
    SF --> L2

    TF --> L3[UMD Loss<br/>MSE on Features]
    SF --> L3

    L1 --> TL[Total Loss]
    L2 --> TL
    L3 --> TL

    TL --> OPT[Optimizer]

    style T fill:#f8fafc,stroke:#7c3aed,stroke-width:3px
    style S fill:#f8fafc,stroke:#2563eb,stroke-width:3px
    style TI fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style TT fill:#2563eb,stroke:#1e40af,stroke-width:2px,color:#fff
    style TF fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#fff
    style SI fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style ST fill:#16a34a,stroke:#15803d,stroke-width:2px,color:#fff
    style SF fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff
    style L1 fill:#ea580c,stroke:#c2410c,stroke-width:2px,color:#fff
    style L2 fill:#ca8a04,stroke:#a16207,stroke-width:2px,color:#fff
    style L3 fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#fff
    style TL fill:#991b1b,stroke:#7f1d1d,stroke-width:3px,color:#fff
    style OPT fill:#059669,stroke:#047857,stroke-width:2px,color:#fff
Loading

Loss Components

  1. SigLIP Loss (LSigLIP): Binary cross-entropy with sigmoid activation for contrastive learning

    L_SigLIP = (1/2) × [BCE(σ(S_I2T), Y) + BCE(σ(S_T2I), Y)]
    

    Where:

    • σ is the sigmoid function
    • S_I2T and S_T2I are similarity matrices (Image-to-Text and Text-to-Image)
    • Y is the ground truth label matrix
    • BCE is Binary Cross-Entropy
  2. Cross-Modal Distillation (CMD) (LCMD): KL divergence between teacher and student similarity distributions

    L_CMD = KL(P_T(S_T) || P_S(S_S))
    

    Where:

    • P_T and P_S are probability distributions over teacher and student similarities
    • S_T and S_S are similarity matrices from teacher and student models
  3. Uni-Modal Distillation (UMD) (LUMD): MSE loss on normalized features from vision and text encoders

    L_UMD = (1/2) × [MSE(f_V^T, f_V^S) + MSE(f_T^T, f_T^S)]
    

    Where:

    • f_V^T and f_V^S are normalized vision features from teacher and student
    • f_T^T and f_T^S are normalized text features from teacher and student
  4. Total Loss:

    L_total = λ_SigLIP × L_SigLIP + λ_CMD × L_CMD + λ_UMD × L_UMD
    

Project Structure

  • tinysiglip/model.py: Student model definition (uses timm for vision encoder)
  • tinysiglip/loss.py: Distillation loss functions (SigLIP loss + CMD + UMD + Embedding Mimicking)
  • tinysiglip/embedding_distillation.py: Token embedding layer distillation utilities (token mapping and weight transfer)
  • tinysiglip/coco_dataset.py: COCO dataset implementation
  • tinysiglip/fake_dataset.py: Dummy dataset for testing
  • tinysiglip/processor.py: Data preprocessing utilities
  • tinysiglip/metrics.py: Evaluation metrics
  • train.py: Training script with Hydra configuration
  • eval_imagenet1k.py: ImageNet-1K zero-shot classification evaluation
  • eval_retrieval.py: Image-text retrieval evaluation

Configuration

Modify config/config.yaml to customize training:

Key Configuration Options

  • Teacher Model (teacher.model_name):

    • Default: google/siglip-base-patch16-224
    • Any SigLIP model from HuggingFace can be used
  • Student Vision Model (student.vision_model_name):

    • Any timm model name (e.g., vit_tiny_patch16_224, resnet50, efficientnet_b0)
    • Pre-trained weights will be loaded automatically
  • Student Vocabulary Size (student.vocab_size):

    • Default: 32000 (for English-only models)
    • Can be set smaller than teacher model to save parameters (e.g., 32K vs 256K saves ~86M params)
    • Common sizes: 32000 (English BPE), 50257 (GPT-2), 49152 (English CLIP)
    • Set to null to use the same vocabulary size as teacher model
  • Training Hyperparameters:

    • Batch size, learning rate, warmup steps, etc.
    • Loss weights: lambda_siglip, lambda_cmd, lambda_umd

See config/config.yaml for full configuration options.

Different Vocabulary Sizes

The student model can use a different vocabulary size than the teacher model, which is useful for creating smaller English-specific models.

How it works:

  1. When using real data (USE_REAL_DATA=True), the code automatically:
    • Loads teacher and student tokenizers
    • Finds shared tokens between vocabularies using create_token_mapping()
    • Transfers embedding weights for shared tokens using transfer_embedding_weights()
  2. When using dummy data (USE_REAL_DATA=False), the code uses create_dummy_token_mapping() for testing purposes

For real applications:

  • Set USE_REAL_DATA=True in the configuration
  • Specify student.tokenizer_name if using a different tokenizer than the teacher
  • The training script automatically handles tokenizer loading and token mapping

Token Embedding Layer Distillation / Weight Transfer

When the student model uses a smaller vocabulary than the teacher model (e.g., 32K vs 256K), there are two methods to transfer knowledge:

Method 1: Weight Transfer (Recommended) ⭐

Core Advantage: Zero runtime overhead, one-time initialization

  • Before training starts, directly copy weights of shared tokens from teacher model's embedding layer to student model
  • No need to compute additional loss terms in the training loop
  • Achieves maximum parameter compression

Implementation:

  • Set USE_WEIGHT_TRANSFER = True (default)
  • Weights are automatically transferred after model initialization
  • Remaining non-shared tokens are randomly initialized and learned during training

Parameter Savings: Using a 32K vocabulary compared to 256K can save approximately 86M parameters (in the embedding layer)

Method 2: Embedding Mimicking Loss

Core Idea:

  • Find shared tokens in student and teacher vocabularies
  • During training, continuously make student model's embeddings for these shared tokens mimic teacher model's embeddings
  • Implemented via MSE loss: L_Emb = MSE(Emb_S(shared tokens), Emb_T(shared tokens))

Implementation:

  • Set USE_WEIGHT_TRANSFER = False
  • Weight controlled by LAMBDA_EMBEDDING (default 0.0, as weight transfer is recommended)
  • Continuously computed during training

Using Real Tokenizers

In real applications, you can use actual tokenizers to find shared tokens and transfer weights:

from tinysiglip.embedding_distillation import create_token_mapping, transfer_embedding_weights
from transformers import AutoTokenizer

# Load tokenizers
student_tokenizer = AutoTokenizer.from_pretrained("your-student-tokenizer")
teacher_tokenizer = AutoTokenizer.from_pretrained(TEACHER_MODEL_NAME)

# Step 1: Find shared tokens between vocabularies
shared_student_indices, shared_teacher_indices = create_token_mapping(
    teacher_tokenizer=teacher_tokenizer,
    student_tokenizer=student_tokenizer,
    verbose=True,
)

# Step 2: Transfer weights for shared tokens
transferred_count = transfer_embedding_weights(
    student_embedding_layer=student_model.text_embedding,
    teacher_embedding_layer=teacher_model.text_model.embeddings.token_embedding,
    shared_student_indices=shared_student_indices,
    shared_teacher_indices=shared_teacher_indices,
    verbose=True,
)

Note: The training script (train.py) automatically uses real tokenizers when USE_REAL_DATA=True and tokenizers are available. Otherwise, it falls back to dummy token mapping for testing purposes.

Ablation Study

Experimental Setup

  • Teacher Model: [To be filled]
  • Student Model: [To be filled]
  • Dataset: [To be filled]
  • Evaluation Metrics: [To be filled]

Results

Configuration SigLIP Loss CMD Loss UMD Loss Embedding Transfer Image-Text Retrieval (R@1) Text-Image Retrieval (R@1) Parameters
Baseline (SigLIP only) [To be filled] [To be filled] [To be filled]
+ CMD [To be filled] [To be filled] [To be filled]
+ UMD [To be filled] [To be filled] [To be filled]
+ Embedding Transfer [To be filled] [To be filled] [To be filled]

Analysis

[To be filled: Analysis of ablation study results, including:

  • Impact of each loss component
  • Effectiveness of embedding weight transfer vs. mimicking loss
  • Trade-offs between model size and performance
  • Comparison with other distillation methods]

References

Core Papers

  1. SigLIP: Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343. arXiv:2303.15343

  2. TinyCLIP: Wu, H., et al. (2023). TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2301.12562

  3. CLIP: Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML). arXiv:2103.00020

Knowledge Distillation

  1. Knowledge Distillation: Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. arXiv:1503.02531

  2. Feature Distillation: Romero, A., et al. (2014). FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. arXiv:1412.6550

Vision-Language Models

  1. ALIGN: Jia, C., et al. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. International Conference on Machine Learning (ICML). arXiv:2102.05918

  2. BLIP: Li, J., et al. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning (ICML). arXiv:2201.12086

Model Compression

  1. Model Compression Survey: Choudhary, T., et al. (2020). A Comprehensive Survey on Model Compression and Acceleration. Artificial Intelligence Review, 53(7), 5113-5155. Link

License

See LICENSE file for details.

Citation

If you use this code in your research, please cite:

@misc{tinysiglip2024,
  title={TinySigLIP: SigLIP Model Distillation with Timm-based Student Architecture},
  author={[Your Name]},
  year={2024},
  howpublished={\url{https://github.com/yourusername/TinySigLIP}}
}

About

TinySigLIP: SigLIP Distillation via Affinity Mimicking and Weight Inheritance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published