-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Context: The forward method in tinyllava/model/vision_tower/base.py assumes a [CLS] token is always present at index 0.
Code Reference: https://github.com/TinyLLaVA/TinyLLaVA_Factory/blob/main/tinyllava/model/vision_tower/base.py
In tinyllava/model/vision_tower/base.py, the forward method (lines 51-54) contains the following logic:
if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
image_features = image_features[:, 1:]
elif kwargs.get('vision_feature_select_strategy', 'patch') == 'cls_patch':
image_features = image_featuresAs SigLIP models (e.g., siglip-so400m-patch14-384) do not utilize a [CLS] token, I think this slicing operation leads to silent data loss by removing the first actual image patch.
When using SigLIP, the current code removes the first spatial patch (top-left corner) of the image. For a 384x384 input, this reduces the token count from 729 to 728, leading to a loss of visual information and potential dimension mismatches.