Skip to content

ccccwei/awesome-cv-synthetic-data-methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Synthetic Data Collection (2022–2025+)

Through synthetic data, teach your models to do anything

中文版本

A carefully curated practical collection of tools, engines, and papers covering two major approaches to synthetic data in computer vision: Simulation-based and Model-based generation.
Pull requests welcome! Star to track real-time updates ✨

Why synthetic data? PBR rendering & physics → photo-realistic; large-scale synthesis; automatic ground truth (segmentation/depth/normals/optical flow/6D pose); controllable randomization (lighting, materials, cameras, backgrounds, occlusion); batch reproduction, scalable expansion. We hope this list brings some help to your vision projects.


Contents


Selection Quick Reference (Strongly Recommended First)

Target Scenario Preferred Tool/Platform Output Annotations Experience Tips Official Links
Robot Perception (Multi-view/Full Annotations) Omniverse Replicator + Isaac Sim / Isaac Lab RGB, depth, normals, segmentation, optical flow, pose, contact PBR+physics+randomization complete, cloud scalable Replicator · Isaac Sim
2D Vision Datasets (Detection/Segmentation) Unity Perception / SynthDet / BlenderProc2 COCO detection, instance/semantic segmentation, keypoints Rich asset library, mature randomization modules Unity Perception · SynthDet · BlenderProc2
Indoor Embodied Navigation/Rearrangement Habitat 2.0 Trajectories, states, depth, semantics Large indoor scenes + interactive tasks Habitat 2.0
Autonomous Driving (Sensor Combinations) CARLA / ℛ-CARLA Extensions Multi-sensor, temporal, segmentation, depth Highly controllable sensors & weather/traffic CARLA · ℛ-CARLA
Image Editing Expansion Stable Diffusion SDXL + ControlNet/IP-Adapter/ComfyUI Text→image, image→image, mask editing Controllable guidance, easy batching SDXL · ControlNet · ComfyUI
Single/Multi-view→3D 3D Gaussian Splatting / Nerfstudio / Zero123++ / TripoSR Mesh/volume/point-gaussian Fast acquisition of high-quality 3D assets 3DGS · Nerfstudio · Zero123++

Simulation-based

General Engines & Platforms

Robotics/Embodied AI Suites

Domain-Specific Simulation Platforms

Annotation & Pipeline Tools

Simulation-based Papers

Focus on simulation-based synthetic data generation, emphasizing Sim→Real transfer effectiveness.

Paper Title Year Conference/Journal Main Contribution Tech Stack Links
Kubric: A Scalable Dataset Generator 2022 CVPR Blender+PyBullet, cross-task high-quality annotations Blender, PyBullet ArXiv · PDF
Large-Scale Synthetic Data for Robot Perception 2024 Isaac Sim + Replicator generating 2.7 million images, validating real-world benefits Isaac, Replicator ArXiv
ORBIT-Surgical 2023–2024 Surgical robotics synthetic tasks/evaluation, simulation to reality Isaac, Surgery GitHub
ℛ-CARLA: Digital Twins & High-Fidelity Sensors 2025 Higher fidelity autonomous driving simulation and sensor models CARLA Extension ArXiv
PCLA: CARLA Agent Testing Framework 2025 Pre-trained agents and systematic scenario-level testing CARLA ArXiv
BlenderProc2 2023+ CV-oriented reproducible experiments and reality gap reduction Blender Docs
Habitat 2.0 2021 Rearrangement tasks, embodied learning simulation data to real-world validation Habitat ArXiv

Simulation-based Reading Points: Focus on randomization knobs (lighting/materials/pose/occlusion/background), annotation types, real-world validation protocols (zero-shot/fine-tuning) and deployment costs (asset preparation/compute/render throughput).


Model-based Generation

Image-level Generation & Editing

Video/Temporal Consistency

3D Generation & Neural Rendering

Model-based Generation Papers

Focus on model-based synthetic data generation, emphasizing generation quality and downstream task benefits.

Paper Title Year Conference/Journal Main Contribution Tech Stack Links
InstaGen: Synthetic Data Boosting Detection 2024 CVPR Using diffusion to generate diverse training samples, significantly improving detection Diffusion, Detection ArXiv
Stable Diffusion 2022 Diffusion model text-to-image generation, open ecosystem Diffusion ArXiv
ControlNet 2023 Conditionally controllable image generation, supports edge/pose/depth guidance Diffusion ArXiv
IP-Adapter 2023 Reference image style transfer, no fine-tuning required Diffusion ArXiv
3D Gaussian Splatting 2023 SIGGRAPH Real-time neural rendering, high-quality 3D reconstruction Neural Rendering ArXiv
Zero123++ 2023 Single-view to 3D generation, zero-shot 3D understanding 3D Generation ArXiv
GET3D 2022 NeurIPS High-quality textured mesh generation 3D Generation ArXiv

Model-based Reading Points: Focus on generation quality, control precision, diversity, downstream task adaptability and computational efficiency.


Recent Papers (2022–2025, Focus)

Focus on papers that use simulation-based or model-based generation with synthetic→real benefits and quantitative evaluation (zero-shot/few-shot/fine-tuning).

Paper Title Year Conference/Journal Main Contribution Tech Stack Links
Kubric: A Scalable Dataset Generator 2022 CVPR Blender+PyBullet, cross-task high-quality annotations Blender, PyBullet ArXiv · PDF
InstaGen: Synthetic Data Boosting Detection 2024 CVPR Using diffusion to generate diverse training samples, significantly improving detection Diffusion, Detection ArXiv
Large-Scale Synthetic Data for Robot Perception 2024 Isaac Sim + Replicator generating 2.7 million images, validating real-world benefits Isaac, Replicator ArXiv
ORBIT-Surgical 2023–2024 Surgical robotics synthetic tasks/evaluation, simulation to reality Isaac, Surgery GitHub
ℛ-CARLA: Digital Twins & High-Fidelity Sensors 2025 Higher fidelity autonomous driving simulation and sensor models CARLA Extension ArXiv
PCLA: CARLA Agent Testing Framework 2025 Pre-trained agents and systematic scenario-level testing CARLA ArXiv
BlenderProc2 2023+ CV-oriented reproducible experiments and reality gap reduction Blender Docs
Habitat 2.0 2021 Rearrangement tasks, embodied learning simulation data to real-world validation Habitat ArXiv

Reading Points: Clearly understand randomization knobs (lighting/materials/pose/occlusion/background), annotation types, real-world validation protocols (zero-shot/fine-tuning) and deployment costs (asset preparation/compute/render throughput).


Survey/Overview Papers

Paper Title Year Main Content Links
A Survey of Synthetic Data Augmentation Methods in Computer Vision 2024 3D graphics, neural rendering, GAN/diffusion, task coverage ArXiv · PDF
A Survey of Data Synthesis Approaches 2024 Multi-domain methods/objectives: diversity, balance, long-tail/boundaries ArXiv
Synthetic Data Generation and Machine Learning: A Review 2023 Comprehensive multi-modal/multi-task overview ArXiv
Sim2Real in Robotics (Recommended) 2022–2024 Simulation to reality transfer survey in robotics Search keywords to quickly locate multiple papers

Classic Synthetic Datasets & Benchmarks (Optional)

Useful for benchmarking and sanity-checking (mostly "foundational" works from 2016–2021).

  • GTA5 / SYNTHIA / Virtual KITTI 2 — Semantic segmentation/driving scenes.
  • SceneNet RGB-D / SunCG — Indoor synthesis and depth.
  • FlyingChairs / FlyingThings3D / MPI-Sintel — Classic synthetic optical flow data.
  • BOP Challenge — 6D pose evaluation and format standardization. https://bop.felk.cvut.cz/

How to Use This List

  1. Choose Engine: Match by domain and resources (Omniverse/Isaac, Unity, Habitat, BlenderProc).
  2. Start from Templates: First run official minimal examples (Replicator/SynthDet/Kubric), then parameterize assets/cameras/lighting/materials.
  3. Rich Annotations: Export depth/normals/segmentation/optical flow/pose simultaneously and unify to COCO, BOP, KITTI and other standard formats.
  4. Evaluate Sim→Real: Fix a real test set; try zero-shot then few-shot fine-tuning; compare with real-only baselines.
  5. Document Randomization: Write "randomization knobs" and sampling distributions into README for easy experiment reproduction and distribution shift research.

Common Pitfalls & Practical Tips

  • Assets Determine Quality Ceiling: Materials (PBR), normal/roughness maps, HDRI lighting determine realism; reject low-quality models.
  • Randomization ≠ Random: Purposefully cover backgrounds/shadows/occlusion/long-tails; use Latin hypercube or stratified sampling to improve coverage.
  • Cameras & Noise: Try to simulate real intrinsics/distortion/exposure/noise/motion blur; otherwise domain gap is obvious.
  • Batch & Cloud Rendering: Prioritize headless rendering and distributed support; measure throughput (fps/it/s) and unit cost.
  • Annotation Consistency: Align training/evaluation formats (categories, occlusion definitions, IoU metrics); maintain data version and configuration traceability.
  • Licensing & Compliance: Third-party assets/commercial model-generated data usage terms must be confirmed.

Curated Lists & Topic Hubs


Contributing

  • New entries should use one sentence to explain applicable scenarios and exportable annotations; including minimal getting-started links is better.
  • Priority: actively maintained repositories and 2022+ papers.
  • If possible, please provide reproduction scripts/configurations or data examples.

License

MIT

About

Collection of papers and project related to synthetic cv data

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published