Skip to content

Conversation

@ruiheng123
Copy link

@ruiheng123 ruiheng123 commented Nov 9, 2025

🚀 Feature(wrh): Adding Implict 3D feature injection to action head (evaluated on Libero Long Tasks)

This PR implements my modification to VLA-adapter by injecting implict 3D feature to action head.

📚 Method:

Inspired by PointVLA and Evo-0, I utilize π3, a foundation model for 3D recounstruction (similar to VGGT) , I inject 3D hidden features to the action head to fuse action features and 3D features. Specifically, the fusing process is formulated as follows:

As mentioned in original paper, the learnable action tokens (input of action head, noted as $$\mathbf{h_{\text{AT}}}$$, we execute 3 times of Attentions (one for $$\text{SA}(\mathbf{h_{\text{AT}}})$$, one for $$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{AQ}}})$$ and another for $$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{vis}}})$$). Or notably:

$$\text{BridgeAttention}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{AT}}} \odot \mathbf{h_{\text{AQ}}} \odot \mathbf{h_{\text{vis}}})$$

where $$\odot$$ means concatenation.

Remark: vis represents hidden state of VLM and AQ represents hidden state of Action Query.

In my implementation, I add the 4th cross attention, noted as $$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{3D}}})$$, where $$\mathbf{h_{\text{3D}}}$$ is the hidden state of π3. Thus the bridge attention can be noted as

$$\text{BridgeAttention}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{AT}}} \odot \mathbf{h_{\text{AQ}}} \odot \mathbf{h_{\text{vis}}} \odot \mathbf{h_{\text{3D}}})$$

I experimented on my 4090D GPUs, setting the injection layers (13 represents middle and 23 represents last) to explore whether π3 enhances the task execution and spatial understanding ability.

Remark: My loss (logged on wandb) on injecting start layer (1st layer) shows higher than middle and last, so the result of start layer is not shown here.

Or you can view this diagram: (the left is default, and the right is my modification. )

image

🧪 Experimental Setup

  • Device: 2× NVIDIA RTX 4090D (48GB) (Author's setting: 4 * H100)
  • Batch Size: 8
  • Accumulate Step: 2
  • Num of Action head layers: 24 (as default)
  • Different settings: Inject to layer 13 (Middle), Inject to layer 23 (last)
  • Evaluate benchmark: Libero Long
  • Comparing process: run same steps and evaluate same steps.

📊 Performance Results

The result is shown in the following image. The comparison with the author's checkpoint is also listed.

image

As a result, some of my result on task 9 (commonly difficult to achieve high success rate in Libero Long) shows some improvement, while other tasks like task 1&2 shows slight decrease.

Moreover, the overall result is shown in the following image.

image

The injection of 3D feat shows a bit improvement to some extent.

where to see

My added run.sh points out the location to set this 3D injection: Set use_3d as True and choose inject_layers, if you input 'all', that means inject 3D features to all action head layers.

  --use_3d True \ 
  --inject_layers all \

and in libero evaluation (eval.sh for libero and eval2.sh for calvin) , add these settings samely.

@WangYH-BUPT
Copy link
Contributor

wow!!! Perfect!! I finished processing over a dozen meetings in the last two days and then worked on this! Thank you so much for your improvements to the adapter!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants