Skip to content

[Bug] MLflowVisBackend.add_image fails when image name has no extension #1681

@kpysanyi-raylytic

Description

@kpysanyi-raylytic

Prerequisite

Environment

OrderedDict([
  ('sys.platform', 'linux'),
  ('Python', '3.11.14'),
  ('CUDA available', True),
  ('GPU 0,1', 'NVIDIA GeForce RTX 3090'),
  ('CUDA_HOME', '/usr/local/cuda'),
  ('NVCC', 'Cuda compilation tools, release 12.6, V12.6.85'),
  ('GCC', '9.4.0'),
  ('PyTorch', '2.1.0+cu121'),
  ('TorchVision', '0.16.0+cu121'),
  ('OpenCV', '4.11.0'),
  ('MMEngine', '0.10.7'),
])

Reproduces the problem - code sample

🧩 Minimal config to reproduce

vis_backends = [
    dict(type='LocalVisBackend'),
    dict(
        type='MLflowVisBackend',
        tracking_uri="http://localhost:2222",
        exp_name='vitdet_coco',
        run_name=None,
    ),
]

visualizer = dict(
    type='DetLocalVisualizer',
    vis_backends=vis_backends,
    name='visualizer',
)

default_hooks = dict(
    logger=dict(type='LoggerHook', interval=50),
    visualization=dict(
        type='DetVisualizationHook',
        draw=True,
        interval=1,
    ),
)

During validation, DetVisualizationHook calls:

visualizer.add_image('val_img', image, step)
  • LocalVisBackend → saves val_img_{step}.png
  • MLflowVisBackend → passes val_img to mlflow.log_image()

Below is a minimal end-to-end reproduction using a tiny COCO subset so it runs quickly, but still triggers a validation visualization step and crashes in MLflowVisBackend.add_image() due to an extension-less image name.

1) Prerequisites

  • MMDetection checkout with COCO available at data/coco/:

    • data/coco/annotations/instances_train2017.json
    • data/coco/annotations/instances_val2017.json
    • data/coco/train2017/*.jpg
    • data/coco/val2017/*.jpg
  • MLflow tracking server running locally (example):

    mlflow server --host 0.0.0.0 --port 2222

2) Create a minimal debug config

Save this as configs/_debug/mlflow_visbackend_no_ext_repro.py:

# Repro config: triggers MLflowVisBackend crash when name has no extension

_base_ = ['./vitdet_mask-rcnn_vit-b-dinov3.py']  # any base detector config that runs on COCO

# Make it fast: 10 train iters then run val once.
train_cfg = dict(type="IterBasedTrainLoop", max_iters=10, val_interval=10)

# Ensure validation is very small (2 iters) but still calls visualization hook.
train_dataloader = dict(
    dataset=dict(
        indices=64,
        # avoid empty dataset when taking first N
        filter_cfg=dict(filter_empty_gt=False, min_size=0),
    )
)
val_dataloader = dict(dataset=dict(indices=2))
test_dataloader = val_dataloader

# Enable visualization every iter.
default_hooks = dict(
    logger=dict(type='LoggerHook', interval=1),
    checkpoint=dict(type='CheckpointHook', by_epoch=False, interval=10, save_last=True),
    visualization=dict(
        type='DetVisualizationHook',
        draw=True,
        interval=1,
        # show=False is default; when show=False, DetVisualizationHook uses an extension-less name
    ),
)

# Use both LocalVisBackend (works) and MLflowVisBackend (crashes).
vis_backends = [
    dict(type='LocalVisBackend'),
    dict(
        type='MLflowVisBackend',
        tracking_uri="http://localhost:2222",
        exp_name='vitdet_coco',
        run_name=None,
    ),
]
visualizer = dict(type='DetLocalVisualizer', vis_backends=vis_backends, name='visualizer')

Reproduces the problem - command or script

3) Run training

python tools/train.py configs/_debug/mlflow_visbackend_no_ext_repro.py

Reproduces the problem - error message

Traceback (most recent call last):
File "/home/kpysanyi/xraivision-backbone/.venv/lib/python3.11/site-packages/PIL/Image.py", line 2526, in save
format = EXTENSION[ext]
~~~~~~~~~^^^^^
KeyError: ''

The above exception was the direct cause of the following exception:
...
File "/home/kpysanyi/xraivision-backbone/.venv/lib/python3.11/site-packages/mmengine/visualization/vis_backend.py", line 784, in add_image
self._mlflow.log_image(image, name)
File "/home/kpysanyi/xraivision-backbone/.venv/lib/python3.11/site-packages/mlflow/tracking/fluent.py", line 1473, in log_image
MlflowClient().log_image(run_id, image, artifact_file, key, step, timestamp, synchronous)
File "/home/kpysanyi/xraivision-backbone/.venv/lib/python3.11/site-packages/mlflow/tracking/client.py", line 2797, in log_image
image.save(tmp_path)
File "/home/kpysanyi/xraivision-backbone/.venv/lib/python3.11/site-packages/PIL/Image.py", line 2529, in save
raise ValueError(msg) from e
ValueError: unknown file extension:

Additional information

🔍 Root cause

Inconsistency between backends:

  • LocalVisBackend.add_image() forces a valid filename
  • MLflowVisBackend.add_image() assumes name already includes an extension
  • step argument is ignored in MLflow backend

This makes MLflowVisBackend fragile and incompatible with existing hooks.


✅ Proposed solutions

Either of the following would fix the issue cleanly:

Option A (configurable)

Add an argument to MLflowVisBackend, e.g.

auto_append_ext=True

which would automatically transform names like val_imgval_img_{step}.png.

Option B (default behavior)

Make MLflowVisBackend.add_image() mirror LocalVisBackend behavior:

  • If name has no extension, append _{step}.png by default.

Example logic:

if '.' not in os.path.basename(name):
    name = f'{name}_{step}.png'
self._mlflow.log_image(image, name)

This would:

  • Make behavior consistent across backends
  • Respect the existing step argument
  • Avoid hard-to-debug runtime crashes

I’d be happy to submit a PR implementing this fix (either as a default behavior or a configurable option), if that aligns with the maintainers’ preferences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions