Stochastic rollouts
We estimate the value function on-the-fly using K independent latent samples from the encoder policy $q_\theta(z\mid x)$.
Published as a conference paper at ICLR 2026
Train discrete autoregressive encoders without straight-through gradient estimators! DAPS uses an ESS-based trust region and a weighted maximum-likelihood update, resulting in stable training and high performance on high-dimensional reconstruction tasks.
Discrete VAEs often rely on biased surrogate gradients (e.g., straight-through estimators) or continuous relaxations with temperature-sensitive bias–variance tradeoffs. DAPS replaces these with a policy-search-style update: sample discrete latents, compute advantages, form an optimal KL-regularized target $q^*$, and update the encoder via weighted MLE.
Optimal non-parametric target distribution under a KL trust-region and entropy regularization.
Reconstructions are highly compressed: all methods use the same bottleneck capacity (1.28 KB per image), yielding a compact discrete latent.
We use DAPS as a compact command space for goal-conditioned robot control. A high-level policy generates discrete latent codes autoregressively from (i) a language prompt and (ii) a desired center-of-mass (COM) velocity/trajectory. A low-level imitation policy then decodes these latents into physically consistent torques in simulation (implemented with LocoMujoco).
We estimate the value function on-the-fly using K independent latent samples from the encoder policy $q_\theta(z\mid x)$.
Score each sampled latent with the reconstruction log-likelihood, then form (baseline-subtracted) advantages.
Solve the optimization problem to get a closed-form target distribution.
Optimal non-parametric target \(q^*\). For each \(x\), we solve:
where \(\mathcal{H}(q)=-\sum_z q(z\mid x)\log q(z\mid x)\). Introducing Lagrange multipliers \(\eta\) and \(\lambda(x)\), the Lagrangian is:
Setting \(\partial \mathcal{L}/\partial q=0\) yields:
Rearranging gives an unnormalized form for \(q^*(z\mid x)\):
Normalizing over \(z\) (using the freedom in \(\lambda(x)\)) then gives the unique distribution satisfying \(\sum_z q^*(z\mid x)=1\).
We update the encoder by minimizing $\mathrm{KL}(q^*\,\|\,q_\theta)$ (a weighted MLE objective), update the decoder by maximum likelihood, and adapt $\eta$ via effective sample size (ESS). Here $N$ is the minibatch size and $K$ is the number of latent samples per datapoint.
@inproceedings{drolet2026daps,
title = {Discrete Variational Autoencoding via Policy Search},
author = {Drolet, Michael and Al-Hafez, Firas and Bhatt, Aditya and Peters, Jan and Arenz, Oleg},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
url = {https://www.drolet.io/daps/}
}