Beyond Action Residuals:
Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

*Equal contribution #Corresponding authors
1CDS HKU, 2SQZ, 3SJTU, 4Institute of Automation, CAS, 5IIIS, THU
Overview

ZPRL enables efficient policy steering and smooth robot actions through a bottleneck latent interface of the base policy learned from imitation.

Video

Real-World Tasks

We run ZPRL on four real-world tasks: place orange, flip egg, open box, and insert bills, as well as some variants of these tasks for robustness tests.

Deployment Demos

We present three longer real-world deployment demos to showcase the full task execution in realistic settings, including cooking a raw egg before flipping it, opening a box to retrieve the flower inside, and inserting paper bills into a wallet. These videos are included in the teaser video above.

(a) Flip Egg in a Real Kitchen

(b) Open Box and Retrieve Flower

(c) Insert Bills after Disturbance

Evaluation

We evaluate the four tasks over 40 trials per task and show representative rollouts as well as training curves below to illustrate policy performance.

(a) Place Orange

(b) Flip Egg

(c) Open Box

(d) Insert Bills

Training curve for Place Orange
Training curve for Flip Egg
Training curve for Open Box
Training curve for Insert Bills

Robustness Tests

We further evaluate robustness under task-specific perturbations. Each task is paired with two stress tests to reveal whether the policy remains stable under distribution shifts and external disturbances.

Place Orange

Disturbance

Model Variation

Flip Egg

Color Robustness

Disturbance

Open Box

Disturbance

OOD Position

Insert Bills

Distractors

Disturbance

Simulation

Main Results

ZPRL finetunes a base flow policy with high sample efficiency and reaches competitive or superior final performance across eight tasks from three benchmarks. All base policies are trained on the same offline dataset per task, a mixed-quality dataset consisting of 50 or 100 trajectories.

Simulation main results

Ablation

We ablate key design choices of ZPRL, including whether the bottleneck is used, the perturbation scale λ, the dimension of z, and the offline data size.

Simulation ablation results

Takeaway:

  • A compact, task-relevant bottleneck latent matters.
  • λ must be large enough to steer, but small enough to stay local.
  • More offline data gives RL a better latent manifold to exploit.
  • ZPRL improves adaptation while preserving the frozen action prior.

What is ZPRL Doing?

We visualize how ZPRL changes decoded features and actions during online RL using UMAP, and quantify how larger perturbation scales λ push the latent and the feature further out of distribution.

Simulation umap analysis

Takeaway:

  • ZPRL mainly remaps state-action associations instead of inventing new ones.
  • Larger λ produces stronger steering.
  • Too large λ pushes the latent out of distribution.
  • Good performance comes from strong but still local steering.

BibTeX

@misc{yu2026zprl,
      title={Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning}, 
      author={Dongjie Yu and Kun Lei and Zhennan Jiang and Jia Pan and Huazhe Xu},
      year={2026},
      eprint={2605.19919},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2605.19919}, 
}