A tiny Vision-Language-Action model for robotic manipulation tasks, implementing advanced Mamba2 blocks and QLoRA optimization techniques.
- Efficient vision encoding with spatial attention mechanisms
- Language understanding through transformer encoders
- Temporal modeling using optimized Mamba2 blocks
- QLoRA adaptation for parameter-efficient fine-tuning
- Support for multimodal inputs including contact forces
[Installation instructions]
Best model with 43.4% accuracy @2% available at: https://www.dropbox.com/scl/fi/36i9j8nx54uqbpqwycock/best_model.pth?rlkey=bhfq2grswkuc9iu6r8w5bcefh&st=x4hgvyly&dl=0 [Basic usage examples]
If you use this code in your research, please cite: [Your preferred citation format]
Apache 2.0