/evla

EdgeVLA: An open-source edge vision-language-action model for robotics.

Primary LanguagePythonMIT LicenseMIT

K-Scale Open Source Robotics

License Discord Wiki

EdgeVLA: An Open-Source Vision-Language-Action Model

Introduction

We propose training efficient VLA models based on SLMs like Qwen2 with non-autoregressive objective. Our early results shows that these models achieve similar training characteristics compared to much larger counterparts. This repository is a direct fork of Prismatic VLMs and OpenVLA. You can train from scratch, finetune or test our pre-trained models. See our blog or our report for more details about the architecture.

Setup

conda create --name evla python=3.10
conda activate evla
cd evla
pip install -e .

Now you have to add HF TOKEN under .hf_token to run models like llama2/3 or qwen2.

Training/Inference

You can either train your own model from scratch or finetune a model with your own dataset. We recommend first running the debug mode to see if everything works.

CUDA_VISIBLE_DEVICES=0 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=1235 python vla-scripts/test.py \
 --vla.type "debug" \
 --data_root_dir DATA_ROOT_DIR \
 --run_root_dir RUN_ROOT_DIR

The full-scale training can be run with the 'evla' config from prismatic/conf/vla.py.

TODO

  1. Remove the hardcoded attention setup.
  2. Export model to the HF format.
  3. Add support for LoRA.

Citation

@article{kscale2024evla,
    title={EdgeVLA: Efficient Vision-Language-Action Models},
    author={Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Aaron Xie, Viraj Tipnis, Benjamin Bolte},
    year={2024}
}