Bart Trzynadlowski, 2024
To learn about how end-to-end neural network-based control of robots works, I built Alex Koch's low-cost robot arm and wrote an iPhone-based teleoperation system for data collection. This repository allows you to:
- Teleoperate the robot arm by waving an iPhone around
- Collect demonstrations for training
- Use one or multiple cameras as input
- Train ACT
- Run the resulting policies
The project has three components:
- Python-based server for operating the robot, taking camera input, and receiving poses from the iOS app
- ARKit-based iOS app that transmits poses over the network
- ACT training and inference server
Imitation learning involves teleoperating a robot to gather example data of a task. During recording, the camera image (or images if using multiple cameras) and robot actuator angles are captured for each frame. These are then used to train ACT, which takes the current camera frame(s) and actuator angles and predicts the next target actuator angles for several time steps into the future.
Typically, this is done by puppeteering a simpler "leader" arm whose joint angles are reflected to the "follower" arm. To avoid having to build a second arm, I wrote an iOS app with ARKit to control the end effector (i.e., gripper) pose in real-time. The server receives poses from the iOS app over the network and uses inverse kinematics to figure out how to actuate the servos. ARKit tracks the phone's pose in 6dof space. To record an episode, the Move button is pressed, which causes the robot to mirror the phone's motions. Pressing Stop or resetting the pose terminates the episode and writes the data to disk.
Teleoperating the arm with an iPhone.
To discover the serial port on which the robot is connected, run:
python -m server --list-serial-ports
To run the robot server, specify the serial port the robot is connected to. You may use wildcards, in which case the first matching port will be used. This is useful on macOS, where port names are long and cumbersome.
python -m server --serial-port=*usb*
The default camera is camera 0. You will need to experiment to figure out which is which on your system. To record episodes to the cube
directory using camera 1:
python -m server --serial-port=*usb* --camera=1 --record-to=cube
Now, when moving the arm using the iOS app, episode datasets will be created in that directory. Each episode is written to its own folder as an HDF5 file along with a video recording (not used for training).
cube/
example-0/
data.hdf5
video.mp4
example-1/
data.hdf5
video.mp4
...
Training episode captured with two cameras.
To replay an episode to the robot (here assuming Windows and COM4 as the serial port):
python -m server --serial-port=COM4 --replay-from=cube/example-0/data.hdf5
Multiple cameras can be specified using a comma-delimited list:
python -m server --serial-port=COM4 --camera=0,1 --record-to=cube_stereo
Once a few dozen episodes have been collected, a policy can be trained. The ACT paper is highly recommended as prerequisite reading material. The model has many parameters but sensible defaults are used here. To train a model on episodes stored in cube
:
python -m act --train --dataset-dir=cube --checkpoint-dir=cube/checkpoints
The resulting checkpoints will be output to cube/checkpoints
. By default, the chunk size is 100. Increasing the batch size (and then lowering the learning rate) can help:
python -m act --train --dataset-dir=cube --checkpoint-dir=cube/checkpoints --batch-size=64 --lr=5e-5
Check out Tony Zhao's helpful ACT tuning tips.
First, launch the inference server:
python -m act --infer --checkpoint-file=cube/checkpoints/policy_best.ckpt
Next, run the robot server and instruct it to run in inference mode. It will send frames and actuator values to the inference server and await responses, which it will execute. Assuming the inference server is running on the same machine:
python -m server --serial-port=COM4 --infer
Otherwise, specify the endpoint explicitly:
python -m server --serial-port=COM4 --infer --inference-endpoint=192.168.0.100:8001
Make sure to specify the same cameras. For example, if the data was collected using two cameras:
python -m server --serial-port=COM4 --infer --camera=0,1
Robot autonomously grasping an object and placing it in a bin.
Have fun!