Official implementation for the paper: TransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation.
Mingwei Li1,2, Hehe Fan1, Yi Yang1
1Zhejiang University, 2Zhongguancun Academy
- [2026-02-06]: TransNormal-Synthetic dataset released on HuggingFace. [Dataset]
- [2026-02-03]: arXiv paper released. [arXiv]
- [2026-01-30]: Project page updated. Code and dataset will be released soon.
Qualitative comparisons on transparent object normal estimation with multiple baselines.
Overview of TransNormal: dense visual semantics guide diffusion-based single-step normal prediction with wavelet regularization.
- Python >= 3.8
- PyTorch >= 2.0.0
- CUDA >= 11.8 (recommended for GPU inference)
Tested Environment:
- NVIDIA Driver: 580.65.06
- CUDA: 13.0
- PyTorch: 2.4.0+cu121
- Python: 3.10
# Clone the repository
git clone https://github.com/longxiang-ai/TransNormal.git
cd TransNormal
# Create and activate conda environment
conda create -n TransNormal python=3.10 -y
conda activate TransNormal
# Install dependencies
pip install -r requirements.txtpip install huggingface_hub
# Download TransNormal model
python -c "from huggingface_hub import snapshot_download; snapshot_download('Longxiang-ai/TransNormal', local_dir='./weights/transnormal')"
⚠️ Important: DINOv3 weights require access approval from Meta AI.
- Visit Meta AI DINOv3 Downloads to request access
- After approval, download the ViT-H+/16 distilled model
- Or use HuggingFace Transformers (version >= 4.56.0):
python -c "from huggingface_hub import snapshot_download; snapshot_download('facebook/dinov3-vith16plus-pretrain-lvd1689m', local_dir='./weights/dinov3_vith16plus')"See weights/README.md for detailed instructions.
from transnormal import TransNormalPipeline, create_dino_encoder
import torch
# Create DINO encoder
# Note: Use bfloat16 instead of float16 to avoid NaN issues with DINOv3
dino_encoder = create_dino_encoder(
model_name="dinov3_vith16plus",
weights_path="./weights/dinov3_vith16plus",
projector_path="./weights/transnormal/cross_attention_projector.pt",
device="cuda",
dtype=torch.bfloat16,
)
# Load pipeline
pipe = TransNormalPipeline.from_pretrained(
"./weights/transnormal",
dino_encoder=dino_encoder,
torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
# Run inference
normal_map = pipe(
image="path/to/image.jpg",
output_type="np", # "np", "pil", or "pt"
)
# Save result
from transnormal import save_normal_map
save_normal_map(normal_map, "output_normal.png")Single Image:
python inference.py \
--image path/to/image.jpg \
--output normal.png \
--model_path ./weights/transnormal \
--dino_path ./weights/dinov3_vith16plus \
--projector_path ./weights/cross_attention_projector.ptBatch Processing:
python inference_batch.py \
--input_dir ./examples/input \
--output_dir ./examples/output \
--model_path ./weights/transnormal \
--dino_path ./weights/dinov3_vith16plusLaunch an interactive web interface:
python gradio_app.py --port 7860Then open http://localhost:7860 in your browser. Use --share for a public link.
The output normal map represents surface normals in camera coordinate system:
- X (Red channel): Left direction (positive = left)
- Y (Green channel): Up direction (positive = up)
- Z (Blue channel): Out of screen (positive = towards viewer)
Output values are in range [0, 1] where 0.5 represents zero in each axis.
Benchmark results on a single GPU (averaged over multiple runs):
| Precision | Time (ms) | FPS | Peak Mem (MB) | Model Load (MB) |
|---|---|---|---|---|
| BF16 | 248 | 4.0 | 11098 | 7447 |
| FP16 | 248 | 4.0 | 11098 | 7447 |
| FP32 | 615 | 1.6 | 10468 | 8256 |
Note: BF16 is recommended over FP16 to avoid potential NaN issues with DINOv3.
We introduce TransNormal-Synthetic, a physics-based dataset of transparent labware with rich annotations.
Download: HuggingFace
| Property | Value |
|---|---|
| Total views | 4,000 |
| Scenes | 10 |
| Image resolution | 800 x 800 |
| Format | WebDataset (.tar shards) |
| Total size | ~7.5 GB |
| License | CC BY-NC 4.0 |
Each sample contains paired RGB images (with/without transparent objects), surface normal maps, depth maps, object masks (all / transparent-only), material-changed RGB, and camera metadata (intrinsics).
import webdataset as wds
dataset = wds.WebDataset(
"hf://datasets/Longxiang-ai/TransNormal-Synthetic/transnormal-{000000..000007}.tar"
).decode("pil")
for sample in dataset:
rgb = sample["with_rgb.png"]
normal = sample["with_normal.png"]
mask = sample["with_mask_transparent.png"]
breakIf you find our work useful, please consider citing:
@misc{li2026transnormal,
title={TransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation},
author={Mingwei Li and Hehe Fan and Yi Yang},
year={2026},
eprint={2602.00839},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.00839},
}This work builds upon:
- Lotus - Diffusion-based depth and normal estimation
- DINOv3 - Self-supervised vision transformer from Meta AI
- Stable Diffusion 2 - Base diffusion model
This project is licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). See the LICENSE file for details.
For commercial licensing inquiries, please contact the authors.
For questions or issues, please open a GitHub issue or contact the authors.
