This is a framework that implements D-FINE architecture for object detection and adds a segmentation head, so you can train an object detection task or instance segmentation task. Detection architecture and loss used from the original repo, everything else was developed from scratch, this is not a fork.
Check out the video tutorial to get familiar with this framework.
This goes beyond the original paper and is developed specifically for this framework. Segmentation task is still an early feature and there are no pretrained weights for segentation head yet. Mosaic augmentation is not recommended for segmentation at this moment.
To run the scripts, use the following commands:
make split # Creates train, validation, and test CSVs with image paths
make train # Runs the training pipeline, including DDP version
make export # Exports weights in various formats after training
make bench # Runs all exported models on the test set
make infer # Runs model ontest folder, saves visualisations and txt preds
make check_errors # Runs model on train and val sets, saves only missmatched boxes with GT
make test_batching # Gets stats to find the optimal batch size for your model and GPU
make ov_int8 # Runs int8 accuracy aware quantization for OpenVINO. Can take several hoursNote: if you want to pass parameters, you can run any of these scripts with python -m src.dl script_name (use etl instead of dl for preprocess and split), You can also just run make to run preprocess, split, train, export, bench scripts as 1 sequence.
For DDP training just set train.ddp.enabled to True, pick number of GPUs and run make train as usual.
git clone https://github.com/ArgoHA/D-FINE-seg.git- For bigger models (l, x) download from gdrive andput into
pretrainedfolder - Prepare your data:
imagesfolder andlabelsfolder - txt file per image in YOLO format. - Customize
config.yaml, minimal example:task. Set tosegmentto enable Segmentation head.exp_name. This is experiment name which is used in model's output folder. After you train a model, you can run export/bench/infer and it will use the model under this name + current date.root. Path to the directory where you store your dataset and where model outputs will be saveddata_path. Path to the folder withimagesandlabelslabel_to_name. Your custom dataset classesmodel_name. Choose from n/s/m/l/x model sizes.- and usual things like: epochs, batch_size, num_workers. Check out config.yaml for all configs.
- Run
preprocessandsplitscripts from d_fine_seg repo. - Run
trainscript, changing confurations, iterating, untill you get desired results. - Run
exportscript to create ONNX, TensorRT, OpenVINO models.
If you run train script passing the args in the command and not changing them in the config file - you should also pass changed args to other scripts like export or infer. Example:
python -m src.dl.train exp_name=my_experiment
python -m src.dl.export exp_name=my_experimentWe use YOLO labels format. One txt file per image (with the same stem). One row = one object.
π data/dataset
βββ π images
βββ π labels
Detection: [class_id, xc, yc, w, h], coords normalized
Segmentation: [class_id, xy, xy, ...], coords normalized. Length = number of points + 1
TensorRT export must be done on the GPU that you are going to use for inferencing.
Half precision:
- usually makes inference faster with minimum accuracy suffering
- works best with TensorRT and OpenVINO (when running on GPU cores). OpenVINO can be exported ones and then can be inferenced in both fp32 or fp16. Note on Apple Silicon right now OpenVINO version of D-FINE works only in full precision.
- Not used for ONNX and Torch at the moment.
Dynamic input means that during inference, we cut black paddings from letterbox. I don't recommend using it with D-FINE as accuracy degrades too much (probably because absolute Positional Encoding of patches)
Use inference classes in src/infer. Currently available:
- Torch
- TensorRT
- OpenVINO
- ONNX
You can run inference on a folder (path_to_test_data) of images or on a folder of videos. Crops will be created automatically. You can control it and paddings from config.yaml in the infer section.
All benchmarks below are on the same custom dataset with D-FINEm at 640Γ640. Latency numbers include image preprocessing -> model inference -> postprocessing.
+----------------------+--------------+--------------+
| Format | F1 score | Latency (ms) |
+----------------------+--------------+--------------+
| Torch, FP32, GPU | 0.9161 | 16.6 |
| TensorRT, FP32, GPU | 0.9166 | 7.5 |
| TensorRT, FP16, GPU | 0.9167 | 5.5 |
| OpenVINO, FP32, CPU | 0.9165 | 115.4 |
| OpenVINO, FP16, CPU | 0.9165 | 115.4 |
| OpenVINO, INT8, CPU | 0.9139 | 44.1 |
| ONNX, FP32, CPU | 0.9165 | 150.6 |
+----------------------+--------------+--------------+
Notes (desktop):
- TensorRT FP16 gives ~3x speedup vs Torch FP32 GPU with no meaningful F1 drop.
- On the CPU, OpenVINO seems to ignore FP16 - it's identical to FP32.
- OpenVINO INT8 on CPU gives ~2.6x speedup vs FP32 with a small F1 drop on this particular dataset.
+----------------------+--------------+--------------+
| Format | F1 score | Latency (ms) |
+----------------------+--------------+--------------+
| OpenVINO, FP32, iGPU | 0.9165 | 350.8 |
| OpenVINO, FP16, iGPU | 0.9157 | 209.6 |
| OpenVINO, INT8, iGPU | 0.9116 | 123.1 |
| OpenVINO, FP32, CPU | 0.9165 | 505.2 |
| OpenVINO, FP16, CPU | 0.9165 | 505.2 |
| OpenVINO, INT8, CPU | 0.9139 | 252.7 |
+----------------------+--------------+--------------+
Notes (edge / N150):
- On the iGPU, FP16 and INT8 both give significant latency reductions with minor F1 degradation.
- On the CPU, FP16 again seems to be ignored, while INT8 still gives a solid speedup.
- FP16 is often a great sweet spot on GPUs: same accuracy, noticeably faster inference.
- On CPUs, FP16 may or may not be accelerated, depending on the hardware.
- INT8 can give big speedups on both CPU and GPU, but the accuracy drop is highly data- and model-dependent.
I recommend always benchmarking on your own hardware and dataset.
Another thing to check on your hardware and model is batch size when you run batched inference (to get higher throughput, losing overall service latency). For that you can simpli run make test_batching, it will run torch model with different batch sizes and calculate throughput (proccesed images per second) and **average latency (per image). For example, with Intel i5-12400F + RTX 5070 Ti and D-FINEm, ~4 is the optimal batch size to inference with Torch.
+------+------------+-------------------+
| bs | throughput | latency_per_image |
+------+------------+-------------------+
| 1.0 | 76.4 | 13.1 |
| 2.0 | 113.4 | 8.8 |
| 4.0 | 138.1 | 7.2 |
| 8.0 | 122.7 | 8.1 |
| 16.0 | 119.7 | 8.4 |
| 32.0 | 117.8 | 8.5 |
+------+------------+-------------------+
- Models: Saved during the training process and export at
output/models/exp_name_date. Includes training logs, table with main metrics, confusion matrics, f1-score_vs_threshold and precisino_recall_vs_threshold. In extended_metrics you can file per class metrics (saved during final eval after all epochs) - Debug images: Preprocessed images (including augmentations) are saved at
output/debug_images/splitas they are fed into the model (except for normalization). - Evaluation predicts: Visualised model's predictions on val set. Includes GT as green and preds as blue.
- Bench images: Visualised model's predictions with inference class. Uses all exported models
- Infer: Visualised model's predictions and predicted annotations in yolo txt format
- Check errors: Creats a folder check_errors with FP and FN bboxes only. Used to check model's errors on training and val sets and to find mislabelled samples.
- Test batching: Csv file with all tested batch sizes and latency
Train
Benchmarking
WandB
Infer
- Training pipeline from SoTA D-FINE model
- Instance Segmentation task.
- Export to ONNX, OpenVino, TensorRT.
- Inference class for Torch, TensorRT, OpenVINO on images or videos
- Label smoothing in Focal loss
- Augs based on the albumentations lib
- Mosaic augmentation, multiscale aug
- Metrics: mAPs, Precision, Recall, F1-score, Confusion matrix, IoU, plots
- Distributed Data Parallel (DDP) training
- After training is done - runs a test to calculate the optimal conf threshold
- Exponential moving average model
- Batch accumulation
- Automatic mixed precision (40% less vRAM used and 15% faster training)
- Gradient clipping
- Keep ratio of the image and use paddings or use simple resize
- When ratio is kept, inference can be sped up with removal of grey paddings
- Visualisation of preprocessed images, model predictions and ground truth
- Warmup epochs to ignore background images for easier start of convirsion
- OneCycler used as scheduler, AdamW as optimizer
- Unified configuration file for all scrips
- Annotations in YOLO format, splits in csv format
- ETA displayed during training, precise strating epoch 2
- Logging file with training process
- WandB integration
- Batch inference
- Early stopping
- Gradio UI demo
- Finetune with layers freeze
- Add support for cashing in dataset
- Smart dataset preprocessing. Detect small objects. Detect near duplicates (remove from val/test)
This project is built upon original D-FINE repo. Thank you to the D-FINE team for an awesome model!
@misc{peng2024dfine,
title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
year={2024},
eprint={2410.13842},
archivePrefix={arXiv},
primaryClass={cs.CV}
}



