Beyond Action Residuals:
Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

Dongjie Yu^*,1,2, Kun Lei^*,2,3, Zhennan Jiang⁴, Jia Pan^#,1, Huazhe Xu^#,2,5,

^*Equal contribution ^#Corresponding authors

¹CDS HKU, ²SQZ, ³SJTU, ⁴Institute of Automation, CAS, ⁵IIIS, THU

ZPRL enables efficient policy steering and smooth robot actions through a bottleneck latent interface of the base policy learned from imitation.

Video

Real-World Tasks

We run ZPRL on four real-world tasks: place orange, flip egg, open box, and insert bills, as well as some variants of these tasks for robustness tests.

Deployment Demos

We present three longer real-world deployment demos to showcase the full task execution in realistic settings, including cooking a raw egg before flipping it, opening a box to retrieve the flower inside, and inserting paper bills into a wallet. These videos are included in the teaser video above.

(a) Flip Egg in a Real Kitchen

(b) Open Box and Retrieve Flower

Evaluation

We evaluate the four tasks over 40 trials per task and show representative rollouts as well as training curves below to illustrate policy performance.

(a) Place Orange

(b) Flip Egg

(d) Insert Bills

Robustness Tests

We further evaluate robustness under task-specific perturbations. Each task is paired with two stress tests to reveal whether the policy remains stable under distribution shifts and external disturbances.

Place Orange

Disturbance

Model Variation

Flip Egg

Color Robustness

Disturbance

Open Box

Disturbance

OOD Position

Insert Bills

Distractors

Disturbance

Simulation

Main Results

ZPRL finetunes a base flow policy with high sample efficiency and reaches competitive or superior final performance across eight tasks from three benchmarks. All base policies are trained on the same offline dataset per task, a mixed-quality dataset consisting of 50 or 100 trajectories.

Ablation

We ablate key design choices of ZPRL, including whether the bottleneck is used, the perturbation scale λ, the dimension of z, and the offline data size.

Takeaway:

A compact, task-relevant bottleneck latent matters.
λ must be large enough to steer, but small enough to stay local.
More offline data gives RL a better latent manifold to exploit.
ZPRL improves adaptation while preserving the frozen action prior.

What is ZPRL Doing?

We visualize how ZPRL changes decoded features and actions during online RL using UMAP, and quantify how larger perturbation scales λ push the latent and the feature further out of distribution.

Takeaway:

ZPRL mainly remaps state-action associations instead of inventing new ones.
Larger λ produces stronger steering.
Too large λ pushes the latent out of distribution.
Good performance comes from strong but still local steering.

BibTeX

@misc{yu2026zprl,
      title={Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning}, 
      author={Dongjie Yu and Kun Lei and Zhennan Jiang and Jia Pan and Huazhe Xu},
      year={2026},
      eprint={2605.19919},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2605.19919}, 
}

Beyond Action Residuals:Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

ZPRL enables efficient policy steering and smooth robot actions through a bottleneck latent interface of the base policy learned from imitation.

Video

Real-World Tasks

Deployment Demos

Evaluation

Robustness Tests

Place Orange

Flip Egg

Open Box

Insert Bills

Simulation

Main Results

Ablation

What is ZPRL Doing?

Related Links

BibTeX

Beyond Action Residuals:
Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning