Overview of Updates

Since submission, we provide new real-world robot evaluations that demonstrate RT-Sketch's compatibility with:

New Embodiments


We deploy RT-Sketch on a Franka Panda robot, demonstrating its compatibility with various robot embodiments.

New IL Backbones


RT-Sketch is compatible with any IL backbone, and we demonstrate its flexibility by implementing a version that uses Diffusion Policy instead of the original Transformer architecture.

New Tasks


We demonstrate 2 tasks on the Panda:


(1) setting the table, w/ different utensil/plate arrangements

(2) opening/closing cabinets

New Sketch Types


For the cabinet task, we demonstrate that the policy can learn to open/close based only on quick-to-draw arrows rather sketching the entire scene.


Implementation Details

For the above experiments, we use a Franka Panda robot with a Robotiq gripper. For each task, we collect on the order of 50-60 demonstrations consisting of delta actions (from a Spacemouse) and observations from two RealSense cameras (wrist-mounted + table-mounted). We then manually sketch the goals (less than 15min. per task). We implement RT-Sketch with a goal-conditioned Diffusion Policy architecture, which uses ResNets to separately encode the agent image, wrist image, wrist depth image, and goal sketch and concatenate the embeddings as input to the noise prediction network. Training takes ~3 hours per task on a single A5000 GPU, and we deploy the resulting policy onto the robot using DDIM as the denoiser and a Polymetis controller running at 10Hz.


Results

Task 1: Table Setting

We train RT-Sketch to perform table setting from 60 demonstrations

.

Sketch Variations Considered

GIF

Successful Rollouts

Sketches can easily capture different desired arrangements of utensils, and the
policy is capable of setting the table accordingly in 10/15 trials:


Description of GIF 1
Description of GIF 1

Description of GIF 1

Robustness to Distractors

Sketches inherently help the policy attend to task-relevant objects, such that the policy is able to complete the task even when things are moving around (people, spills, etc.) or distractor objects are present (napkins, etc.).


Description of GIF 1
Description of GIF 1

Description of GIF 1


Failures

Occasionally, the policy struggles with imprecision which can lead to failed grasps, but typically still makes partial task progress:


Description of GIF 1
Description of GIF 1


Task 2: Drawer Opening and Closing

We train RT-Sketch to perform drawer opening and closing from 50 demonstrations, but specifically we consider sketches which are arrows drawn over the current image rather than sketches of the entire scene. Here, arrows can represent which drawer (top or bottom) should be opened or closed.

.

Successful Rollouts

We see that even from extremely minimal goals (less than 5 seconds to specify),
the policy is able to interpret and act upon the intended goal.


Description of GIF 1
Description of GIF 1
Description of GIF 1
Description of GIF 1

Failures

Of course, the policy is not without failures and a typical failure mode is improperly grasping the cabinet handle:


Description of GIF 1

Limitations

We acknowledge that extremely minimal sketches remains a challenging problem, and we hope to further explore this direction in future work. Ongoing advances in image-to-sketch conversion and more drastic data augmentations can likely help to address this class of sketches. The above policies also generalize only within a relatively small range of the workspace, so we hope to improve the sample efficiency and generalization capabilities of these models in the future.