thu-ml/RoboticsDiffusionTransformer

Nice work! How do you deal with different embodiments when sharing the same action space?

Closed this issue · 1 comments

Hello @csuastt @ethan-iai ,

Thank you for sharing it!

In the appendix, you show the following action space:

图片

But for different robotics arms, the "arm joint position" implies different forward kinematics. How do you deal with such kind of differences?

Another approach may be transfer all actions to end effector 6d pose + gripper open width. Since you did not choose this approach, I guess there are some problems with it (e.g, some datasets don't support it or there is no forward kinematics info?). Could you please introduce why it is infeasible...

It's a nice work that takes significant engineering effort. Great thanks for it!

Thank you for your insightful question.

Q1. How to deal with such kind of differences?

A: In short, we do not apply additional processing after embedding action into the unified space. Here are two main reasons why there's no extra pre-processing:

  1. we adopt pretrain-finetune paradigm..The pre-training stage, which uses multi-robot data focuses on learning general movement patterns rather than precise control, which do not requires all the arm joint position with same forward kinematics. This allows the model to develop a "de-javu" understanding of robot motion without needing to be tied to specific embodiments.During the fine-tuning stage, we align the model with the target embodiment and equip it with the ability to perform precise control with designated kinematics

  2. Our evaluation results demonstrate that the model can maintain consistent sampling MSE error2on different embodiments, even when controlled with arm ioint positions and varying forwardkinematics. One possible explanation is that model may be inferring its embodiment based onvisual observations and proprioception to address the heterogeneity of embodiments on its own

Q2. why conversion to EEF is not adopted?

A: We chose not to convert to EEF for several reasons:

  1. Broader Usage and Openness: By working directly with joint positions and other kinematic configurations, we ensure compatibility with a wider range of datasets and applications which is more aligned with our openness principles.

  2. Potential for increased Errors: Conversion to EEF can introduce additional errors that may hinder the model's performance in tasks requiring precise control.

  3. Dataset Limitations: As mentioned, some datasets only provide actions represented by joint positions, making direct conversion to EEF almost impossible.

Thank you again for your engagement! Please feel free to reach out if you have further questions.