This repository provides a finetuning implementation for the Qwen/Qwen2.5-VL-7B-Instruct specifically optimized for medical image analysis tasks in the FLARE 2025 2D Medical Multimodal Dataset challenge.
The pre-trained baseline model is available at:
🤗 Model: leoyinn/qwen3vl-flare25
This project adapts the official Qwen3-VL finetuning framework for medical multimodal vision-language tasks across 19 datasets in 8 medical imaging modalities from the FLARE 2025 challenge.
The pipeline supports all 19 datasets across 8 medical imaging modalities:
- Retinography: retino, fundus
- Ultrasound: BUSI-det, BUS-UCLM-det, BUSI, BUS-UCLM, iugc
- X-ray: boneresorption, dental, periapical, IU_XRay, chestdr
- Clinical: neojaundice
- Microscopy: chromosome, neurips22cell, bone_marrow
- Endoscopy: endo
- Dermatology: bcn20000
- Mammography: CMMD
- Classification (Balanced Accuracy)
- Multi-label Classification (F1 Score)
- Detection (F1 Score @ IoU 0.5)
- Instance Detection (F1 Score @ IoU 0.3/0.5)
- Cell Counting (Mean Absolute Error)
- Regression (Mean Absolute Error)
- Report Generation (Comprehensive GREEN Score)
- Qwen3-VL/: Official Qwen3-VL codebase with finetuning framework
- data_conversion/: Scripts to convert FLARE dataset to Qwen3-VL format
- organized_dataset/: FLARE 2025 medical imaging datasets (not included, see Dataset Access)
- Python 3.10+
- CUDA 11.8+ with GPU (minimum 24GB VRAM recommended)
- 100GB+ free disk space for datasets and models
- uv package manager (recommended)
# Clone the repository
git clone https://github.com/medfm-flare/FLARE25-QWen3VL-4B.git
cd FLARE25-QWen3VL-4B
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv syncDownload the FLARE 2025 2D MLLM dataset and organize it as:
organized_dataset/
├── training/
│ ├── Retinography/
│ │ ├── retino/
│ │ │ ├── imagesTr/
│ │ │ └── retino_questions_train.json
│ │ └── fundus/
│ └── ...
├── validation-public/
└── validation-hidden/
# Convert all datasets
uv run python data_conversion/convert_flare_to_qwen3vl.py
# Or convert specific datasets
uv run python data_conversion/convert_flare_to_qwen3vl.py --datasets neojaundice retino
# Validate conversion
uv run python data_conversion/validate_conversion.pyThis creates JSON files in Qwen3-VL format with proper <image> tags:
{
"image": ["/path/to/img1.jpg", "/path/to/img2.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nDoes this newborn require phototherapy? A. No, B. Yes"
},
{
"from": "gpt",
"value": "A"
}
]
}Edit Qwen3-VL/qwen-vl-finetune/qwenvl/data/__init__.py to register your converted datasets:
# Example dataset registration
FLARE_NEOJAUNDICE = {
"annotation_path": "/absolute/path/to/converted_neojaundice.json",
"data_path": "", # Empty if using absolute paths
}
data_dict = {
"flare_neojaundice": FLARE_NEOJAUNDICE,
# Add all 19 FLARE datasets...
}cd Qwen3-VL/qwen-vl-finetune
# Edit the training script to specify your datasets
# vim scripts/sft_qwen3_4b_flare.sh
# Run training
bash scripts/sft_qwen3_4b_flare.shUse the % syntax to control sampling rates:
--dataset_use flare_neojaundice%100,flare_retino%50,flare_fundus%100This trains on 100% of neojaundice and fundus, but only 50% of retino samples.
FLARE25-QWen3VL-4B/
├── Qwen3-VL/ # Official Qwen3-VL codebase
│ └── qwen-vl-finetune/
│ ├── qwenvl/
│ │ ├── train/
│ │ │ ├── train_qwen.py # Main training entry point
│ │ │ ├── trainer.py # Custom trainer with flash attention
│ │ │ └── argument.py # Training configuration
│ │ ├── data/
│ │ │ ├── __init__.py # Dataset registry (EDIT THIS)
│ │ │ ├── data_processor.py # Data loading and preprocessing
│ │ │ └── rope2d.py # Position encoding for Qwen3-VL
│ │ └── tools/
│ │ ├── pack_data.py # Data packing for efficiency
│ │ └── check_image.py # Image validation
│ └── scripts/
│ ├── sft_qwen3_4b.sh # Base training script
│ ├── sft_qwen3_4b_flare.sh # FLARE-specific training
│ ├── zero3.json # DeepSpeed ZeRO-3 config
│ └── zero2.json # DeepSpeed ZeRO-2 config
├── data_conversion/ # FLARE to Qwen3-VL conversion
│ ├── convert_flare_to_qwen3vl.py # Main conversion script
│ ├── dataset_configs.py # Dataset configuration
│ └── validate_conversion.py # Validation utilities
├── organized_dataset/ # FLARE 2025 data (not included)
├── pyproject.toml # Python project config
└── README.md # This file
The conversion script handles:
- Multi-image samples: Automatically handles both single and multi-image questions
- Image validation: Checks image existence and integrity
- Path resolution: Converts relative paths to absolute paths
- Tag insertion: Properly inserts
<image>tags in prompts - Format verification: Ensures number of
<image>tags matches number of images
- Load FLARE annotations from
*_questions_train.json - Validate images for each sample
- Build absolute paths relative to dataset directory
- Insert
<image>tags at the beginning of questions - Convert to Qwen3-VL format with proper conversation structure
- Save converted data to output directory
- Vision Encoder (ViT): Frozen (
--tune_mm_vision False) - Vision-Language Projection: Trainable (
--tune_mm_mlp True) - Language Model: Trainable (
--tune_mm_llm True)
This strategy preserves strong visual representations while adapting to medical domain.
- DeepSpeed ZeRO-3 for model parallelism
- Gradient checkpointing to reduce memory
- Flash Attention 2 for efficient attention computation
- Data packing for better GPU utilization
- Label masking: Only assistant responses contribute to loss
- Multi-image support: Handles 1-3 images per sample
- Dynamic resolution: Adapts to image sizes within pixel constraints
- Retry logic: Handles corrupted samples gracefully
# Enable WandB logging in training script
--report_to wandb
--run_name qwen3vl_flare_experiment
# View training progress at wandb.aiAfter training, load your model for inference:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# Load fine-tuned model
model = Qwen3VLForConditionalGeneration.from_pretrained("./output")
processor = AutoProcessor.from_pretrained("./output")
# Run inference
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is the diagnosis?"}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(output, skip_special_tokens=True)[0]We evaluated our finetuned Qwen3-VL-4B model against the baseline pretrained model across all 7 task types in the FLARE 2025 challenge. The results demonstrate significant improvements across all medical imaging tasks:
| Task | Primary Metric | Baseline | Finetuned | Improvement |
|---|---|---|---|---|
| Classification | Balanced Accuracy | 2.2% | 53.5% | +2,309% |
| Detection | F1@0.5 | 0.0% | 80.3% | ∞ (new capability) |
| Instance Detection | F1@0.5 | 0.01% | 1.0% | +9,900% |
| Multi-label Classification | F1 Macro | 28.3% | 50.3% | +77.7% |
| Regression | MAE | 35.8 | 22.4 | +37.3% |
| Counting | MAE | 417.7 | 244.4 | +41.5% |
| Report Generation | GREEN Score | 67.7% | 80.8% | +19.4% |
Classification (Disease Diagnosis)
- Balanced accuracy improved from 2.2% to 53.5%
- Strong performance on retinography, dermatology, and clinical imaging
Detection (Lesion Localization)
- Baseline model showed near-zero detection capability
- Finetuned model achieved 80.3% F1@0.5 on ultrasound and X-ray lesions
Report Generation (Radiology Reports)
- GREEN Score: 67.7% → 80.8%
- Clinical Efficacy: 32.7% → 95.3% (+191.7%)
- Location Accuracy: 22.3% → 70.7% (+217.4%)
The finetuning successfully adapted the general-purpose vision-language model to specialized medical imaging tasks, demonstrating the effectiveness of domain adaptation for medical AI applications.
If you use this implementation in your research, please cite:
@misc{qwen3vl-flare2025,
title={Qwen3-VL Fine-tuned for FLARE 2025 Medical Image Analysis},
author={Shuolin Yin},
year={2025},
publisher={GitHub},
url={https://github.com/medfm-flare/FLARE25-QWen3VL-4B}
}
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
The FLARE 2025 datasets can be accessed at:
- Main Dataset: FLARE-MedFM/FLARE-Task5-MLLM-2D
- Challenge Info: FLARE 2025 Official Website
- FLARE25-QWen2.5VL: Previous implementation using Qwen2.5-VL-7B (leoyinn/qwen2.5vl-flare2025)
- Official Qwen3-VL: QwenLM/Qwen3-VL
- Qwen team for the Qwen3-VL model and finetuning framework
- FLARE 2025 organizers for the dataset and challenge
- HuggingFace for the transformers library and model hosting
- Medical imaging communities for the public datasets
For issues and questions:
- GitHub Issues: Report bugs or request features
- FLARE Challenge: Official challenge forum
