the re-targeting game

What is re-targeting in robotics?

re-targeting is mapping human motions to robot motions with the goal of transferring the natural movements of human to a robot. Most of the discussion here will be about egocentric human hand re-targeting to robot hands(Inspire hand mainly).

Project github: https://github.com/angkul07/egodex_retargeting

Generally re-targeting is very successful. re-targeted datasets are pseudo telemetry dataset. For example, the KL divergence between the re-targeted robot movements which I build in the experiment below and ego-dex hand movement is approximately zero(0.0285). My goal was to build such re-targeted datasets from open-source egocentric datasets which can be used to train robot policies. However, despite being a successful alternative method to telemetry dataset, re-targeting has one major problem.

Fundamentally human hands and robot hands work differently. Physical fidelity such as force transfer, touch, friction etc. are present in the human hand while they are absent in the re-targeted robot hands. But so does they are absent in most of the teleops datasets, every egocentric video dataset. The real issue is that when a policy is trained on re-targeted dataset and any egocentric dataset, the performance is worse for the former, even tho the re-targeted dataset contain the very similar action, observation states as these egocentric dataset. The physical fidelity gap is hard to close. Second issue is that VLAs can’t learn from the re-targeted dataset.
Sim-to-real transfer for re-targeted dataset is very hard. After talking to a friend, I concluded that it is just a stupid idea.

I tried to train an ACT policy on a re-targeted dataset of basic pick up and place task from ego-dex dataset. The entire experiment is divided into four stages.

Stage 1: Stabilization

This stage takes the 30 Hz egocentric RGB video from the AVP head-mounted camera(from egodex dataset) and warps every frame into the reference frame viewpoint of the tabletop. The warp is a plane-induced homography derived analytically from the known per-frame camera pose and the estimated table-plane y.

Output:

0_stabilized.mp4

This solves an important problem. Every frame has two kinds of motions superimposed:

Signal - the hand and object moving (what the BC policy must learn).
Noise - the wearer's head bobbing (what the policy should ignore).

So, without stabilization the ACT ResNet18 vision backbone spends capacity modelling head motion instead of manipulation dynamics, and consecutive-frame pixel displacement is dominated by ego-motion rather than the actual grasp/transport/place. In simple words, ACT will focus on head motion instead of learning the actual tasks.

The output of stage 1 was feed to the ACT policy in stage 4.

Stage 2: Action Alignment

This stage converts the active-hand's world-frame wrist trajectory from an EgoDex HDF5 into a 6-DOF UR5e joint-angle trajectory plus a binary gripper signal. It's the bridge from "where the human wrist was in 3-D space" to "what joint angles a robot arm would need to put its tool flange there".