Vision-Language-Action models
Introduction to VLA
Vision-Language-Action models (VLA)
Arxiv: Vision-Language-Action Models: Concepts, Progress, Applications and Challenges
Open X-Embodiment
Arxiv: Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Robotics Transformer (RT-1)
BridgeData V2
Arxiv: BridgeData V2: A Dataset for Robot Learning at Scale
Github: https://github.com/rail-berkeley/bridge_data_v2
OpenVLA
Arxiv: OpenVLA: An Open-Source Vision-Language-Action Model
Installation
!git clone https://github.com/openvla/openvla
%cd openvla
!pip install -r requirements-min.txt
For example, to load openvla-7b for zero-shot instruction following in the BridgeData V2 environments with a WidowX robot:
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"
# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
# Execute...
robot.act(action, ...)
OpenVLA-OFT
Arxiv: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Github: https://github.com/moojink/openvla-oft
SmolVLA
HuggingFace”: lerobot/smolvla_base
**Arxiv: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SO-101 robot
SO-ARM101 AI 機器手臂PRO套件 for LeRobot
This site was last updated October 02, 2025.