Hi, great work on VLA-0! The simplicity of representing actions as text tokens is really elegant.
I made a minimal reimplementation using Hugging Face TRL's SFTTrainer for my own learning:
https://github.com/MilkClouds/vla0-trl
A few notes:
- ~1,200 lines (relies heavily on TRL abstractions)
- Flash Attention enabled by default
- Tested on LIBERO, gets ~90% avg (slightly below paper results, likely due to hyperparameter differences)
Not intended as a replacement—just a simpler entry point for those already familiar with the HF/TRL ecosystem. All credit goes to the original authors.
If there's any concern about this, please let me know.
Thanks again for open-sourcing the code and weights!