Skip to content

[Bug] Hardcoded slicing in forward method causes data loss for SigLIP vision towers #203

@jhjangjh

Description

@jhjangjh

Context: The forward method in tinyllava/model/vision_tower/base.py assumes a [CLS] token is always present at index 0.

Code Reference: https://github.com/TinyLLaVA/TinyLLaVA_Factory/blob/main/tinyllava/model/vision_tower/base.py

In tinyllava/model/vision_tower/base.py, the forward method (lines 51-54) contains the following logic:

if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
    image_features = image_features[:, 1:]
elif kwargs.get('vision_feature_select_strategy', 'patch') == 'cls_patch':
    image_features = image_features

As SigLIP models (e.g., siglip-so400m-patch14-384) do not utilize a [CLS] token, I think this slicing operation leads to silent data loss by removing the first actual image patch.

When using SigLIP, the current code removes the first spatial patch (top-left corner) of the image. For a 384x384 input, this reduces the token count from 729 to 728, leading to a loss of visual information and potential dimension mismatches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions