Feature(wrh): Adding Implict 3D feature injection to action head (evaluated on Libero Long Tasks) #20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🚀 Feature(wrh): Adding Implict 3D feature injection to action head (evaluated on Libero Long Tasks)
This PR implements my modification to VLA-adapter by injecting implict 3D feature to action head.
📚 Method:
Inspired by PointVLA and Evo-0, I utilize π3, a foundation model for 3D recounstruction (similar to VGGT) , I inject 3D hidden features to the action head to fuse action features and 3D features. Specifically, the fusing process is formulated as follows:
As mentioned in original paper, the learnable action tokens (input of action head, noted as$$\mathbf{h_{\text{AT}}}$$ , we execute 3 times of Attentions (one for $$\text{SA}(\mathbf{h_{\text{AT}}})$$ , one for $$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{AQ}}})$$ and another for $$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{vis}}})$$ ). Or notably:
where$$\odot$$ means concatenation.
Remark: vis represents hidden state of VLM and AQ represents hidden state of Action Query.
In my implementation, I add the 4th cross attention, noted as$$\text{CA}(\mathbf{h_{\text{AT}}}, \mathbf{h_{\text{3D}}})$$ , where $$\mathbf{h_{\text{3D}}}$$ is the hidden state of π3. Thus the bridge attention can be noted as
I experimented on my 4090D GPUs, setting the injection layers (13 represents middle and 23 represents last) to explore whether π3 enhances the task execution and spatial understanding ability.
Remark: My loss (logged on wandb) on injecting start layer (1st layer) shows higher than middle and last, so the result of start layer is not shown here.
Or you can view this diagram: (the left is default, and the right is my modification. )
🧪 Experimental Setup
📊 Performance Results
The result is shown in the following image. The comparison with the author's checkpoint is also listed.
As a result, some of my result on task 9 (commonly difficult to achieve high success rate in Libero Long) shows some improvement, while other tasks like task 1&2 shows slight decrease.
Moreover, the overall result is shown in the following image.
The injection of 3D feat shows a bit improvement to some extent.
where to see
My added
run.shpoints out the location to set this 3D injection: Setuse_3dasTrueand chooseinject_layers, if you input'all', that means inject 3D features to all action head layers.--use_3d True \ --inject_layers all \and in libero evaluation (
eval.shfor libero andeval2.shfor calvin) , add these settings samely.