[Bug] Hardcoded slicing in `forward` method causes data loss for SigLIP vision towers

Context: The forward method in tinyllava/model/vision_tower/base.py assumes a [CLS] token is always present at index 0.

Code Reference: https://github.com/TinyLLaVA/TinyLLaVA_Factory/blob/main/tinyllava/model/vision_tower/base.py

In `tinyllava/model/vision_tower/base.py`, the `forward` method (lines 51-54) contains the following logic:

```python
if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
    image_features = image_features[:, 1:]
elif kwargs.get('vision_feature_select_strategy', 'patch') == 'cls_patch':
    image_features = image_features
```

As SigLIP models (e.g., siglip-so400m-patch14-384) do not utilize a [CLS] token, I think this slicing operation leads to silent data loss by removing the first actual image patch.

When using SigLIP, the current code removes the first spatial patch (top-left corner) of the image. For a 384x384 input, this reduces the token count from 729 to 728, leading to a loss of visual information and potential dimension mismatches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Hardcoded slicing in `forward` method causes data loss for SigLIP vision towers #203

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Hardcoded slicing in forward method causes data loss for SigLIP vision towers #203

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Hardcoded slicing in `forward` method causes data loss for SigLIP vision towers #203