About
Hi, I am a Marie Skłodowska-Curie postdoctoral fellow working on reinforcement learning with Prof. Aditya Mahajan at McGill University and Prof. Pierre-Luc Bacon at Mila Québec. I am interested in partial observability, representation learning, world modeling, asymmetric learning, exploration and generalization. I recently obtained my PhD with Prof. Damien Ernst at the University of Liège.
Publications
-
Maximum-Entropy Exploration with Future State-Action Visitation Measures.
Bolland, Lambrechts, Ernst. Reinforcement Learning Conference, August 2026.Abstract.
Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the negated entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to a concurrent maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be computed off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.Article.
-
Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access.
Ebi, Ernst, Böhm, Lambrechts. International Conference on Machine Learning, July 2026.Abstract.
Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.Article.
-
Real-World Reinforcement Learning of Active Perception Behaviors.
Hu, Wang, Yuan, Luo, Li, Lambrechts, Rybkin, Jayaraman. Neural Information Processing Systems, December 2025.Abstract.
A robot's instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today's standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, asymmetric advantage weighted regression (AAWR), exploits access to "privileged" extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a "generalist" robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks.Article.
Blog Post.
Code.
Poster.
-
Behind the Myth of Exploration in Policy Gradients.
Bolland, Lambrechts, Ernst. European Workshop on Reinforcement Learning, September 2025.Abstract.
Policy-gradient algorithms are effective reinforcement learning methods for solving control problems with continuous state and action spaces. To compute near-optimal policies, it is essential in practice to include exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis and distinguish two different implications of these techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter update eventually provides an optimal policy. In light of these effects, we discuss and illustrate empirically exploration strategies based on entropy bonuses, highlighting their limitations and opening avenues for future works in the design and analysis of such strategies.Article.
-
A Theoretical Justification for Asymmetric Actor-Critic Algorithms.
Lambrechts, Ernst, Mahajan. International Conference on Machine Learning, July 2025.Abstract.
In reinforcement learning for partially observable environments, many successful algorithms were developed under the asymmetric learning paradigm. This paradigm leverages eventual additional state information available at training for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates an error term arising from aliasing in the agent state.Article.
Blog Post.
Poster.
Slides.
-
Reinforcement Learning in Partially Observable Markov Decision Processes.
Lambrechts. University of Liège, April 2025.Abstract.
Intelligence is usually understood as the ability to make decisions, based on perception, in order to achieve objectives. In other words, intelligence is about perceiving and abstracting past information about the world for then acting on its future execution. This thesis focuses on reinforcement learning in partially observable Markov decision processes for learning intelligent behaviors through interaction. In particular, this manuscript explores and emphasizes the interplay between perception, representations, memory, predictions and decisions. After introducing the theoretical foundations, the core contributions of the thesis are presented across three thematic parts. The first part, “Learning and Remembering,” investigates how learning intelligent behaviors improves memory and vice versa. To begin with, it studies how learning to act optimally results in representations of the perception history that encode the posterior distribution over the states, known as the belief. Next, it studies how long-term memory improves the ability to learn intelligent behaviors, by designing an initialization procedure for recurrent neural networks that endows them with long-term memorization abilities. The second part, “Leveraging Additional Information,” explores how additional information about the world can be used to learn intelligent behaviors faster than when learning from perception only. It starts by empirically showing that world models predicting this additional information provide better history representations and faster learning. Then, it provides a theoretical justification for the improved convergence speed of a particular algorithm that leverages this information, namely the asymmetric actor-critic algorithm. The third part, “Entangling Predictions and Decisions,” proposes several architectural innovations for obtaining world models that efficiently generate trajectories. First, it develops new sequence models that parallelize autoregressive generation, while being implicitly recurrent to allow resuming generation. Afterwards, it elaborates on their use as new world models that are able to generate trajectories in parallel through specific latent policies. Finally, this thesis concludes by summarizing how learning adequate representations of the perception history is paramount to learning to make decisions under partial observability. In the perspective of developing general intelligence, this thesis also motivates the shift from specialized abstractions to generalizable abstractions extending across diverse environments.Thesis.
Slides.
-
Informed POMDP: Leveraging Additional Information in Model-Based RL.
Lambrechts, Bolland, Ernst. Reinforcement Learning Conference, August 2024.Abstract.
In this work, we generalize the problem of learning through interaction in a POMDP by accounting for eventual additional information available at training time. First, we introduce the informed POMDP, a new learning paradigm offering a clear distinction between the training information and the execution observation. Next, we propose an objective for learning a sufficient statistic from the history for the optimal control that leverages this information. We then show that this informed objective consists of learning an environment model from which we can sample latent trajectories. Finally, we show for the Dreamer algorithm that the convergence speed of the policies is sometimes greatly improved in several environments by using this informed environment model. Those results and the simplicity of the proposed adaptation advocate for a systematic consideration of eventual additional information when learning in a POMDP using model-based RL.Article.
Poster.
Slides.
Code.
-
Parallelizing Autoregressive Generation with Variational State Space Models.
Lambrechts*, Claes*, Geurts, Ernst. ICML Workshop on Sequence Modeling Architectures, July 2024.Abstract.
Attention-based models such as Transformers and recurrent models like state space models (SSMs) have emerged as successful methods for autoregressive sequence modeling. Although both enable parallel training, none enable parallel generation due to their autoregressiveness. We propose the variational SSM (VSSM), a variational autoencoder (VAE) where both the encoder and decoder are SSMs. Since sampling the latent variables and decoding them with the SSM can be parallelized, both training and generation can be conducted in parallel. Moreover, the decoder recurrence allows generation to be resumed without reprocessing the whole sequence. Finally, we propose the autoregressive VSSM that can be conditioned on a partial realization of the sequence, as is common in language generation tasks. Interestingly, the autoregressive VSSM still enables parallel generation. We highlight on toy problems (MNIST, CIFAR) the empirical gains in speed-up and show that it competes with traditional models in terms of generation quality (Transformer, Mamba SSM).Article.
Poster.
-
Warming Up RNNs to Maximize Reachable Multistability Greatly Improves Learning.
Lambrechts*, De Geeter*, Vecoven*, Ernst, Drion. Neural Networks, August 2023.Abstract.
Training recurrent neural networks is known to be difficult when time dependencies become long. In this work, we show that most standard cells only have one stable equilibrium at initialisation, and that learning on tasks with long time dependencies generally occurs once the number of network stable equilibria increases; a property known as multistability. Multistability is often not easily attained by initially monostable networks, making learning of long time dependencies between inputs and outputs difficult. This insight leads to the design of a novel way to initialise any recurrent cell connectivity through a procedure called “warmup” to improve its capability to learn arbitrarily long time dependencies. This initialisation procedure is designed to maximise network reachable multistability, i.e., the number of equilibria within the network that can be reached through relevant input trajectories, in few gradient steps. We show on several information restitution, sequence classification, and reinforcement learning benchmarks that warming up greatly improves learning speed and performance, for multiple recurrent cells, but sometimes impedes precision. We therefore introduce a double-layer architecture initialised with a partial warmup that is shown to greatly improve learning of long time dependencies while maintaining high levels of precision. This approach provides a general framework for improving learning abilities of any recurrent cell when long time dependencies are present. We also show empirically that other initialisation and pretraining procedures from the literature implicitly foster reachable multistability of recurrent cells.Article.
Blog Post.
Code.
-
Recurrent Networks, Hidden States and Beliefs in Partially Observable Environments.
Lambrechts, Bolland, Ernst. Transaction on Machine Learning Research, August 2022.Abstract.
Reinforcement learning aims to learn optimal policies from interaction with environments whose dynamics are unknown. Many methods rely on the approximation of a value function to derive near-optimal policies. In partially observable environments, these functions depend on the complete sequence of observations and past actions, called the history. In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This correlation is measured through their mutual information. In addition, we show that the expected return of an agent increases with the ability of its recurrent architecture to reach a high mutual information between its hidden states and the beliefs. Finally, we show that the mutual information between the hidden states and the beliefs of variables that are irrelevant for optimal control decreases through the learning process. In summary, this work shows that in its hidden states, a recurrent neural network approximating the Q-function of a partially observable environment reproduces a sufficient statistic from the history that is correlated to the relevant part of the belief for taking optimal actions.Article.
Blog Post.
Poster.
Code.
Talks
-
Uncertainty in Asymmetric Reinforcement Learning
Mila RL Sofa, April 10th, 2026.Slides.
-
Partial Observability and Asymmetric Observability.
BeNeRL Workshop, July 4th, 2025.Slides.
-
Reinforcement Learning in Partially Observable Markov Decision Processes.
PhD Defense, April 7th, 2025.Slides.
-
A Theoretical Justification for Asymmetric Actor-Critic Algorithms.
Mila RL Sofa, December 20th, 2024.Slides.
-
Informed POMDP: Leveraging Additional Information in Model-Based RL.
Reinforcement Learning Conference, August 12th, 2024.Slides.
-
Learning to Remember the Past by Learning to Predict the Future.
VUB Reinforcement Learning Talks, November 17th, 2023.Slides.
Code
Contact
McConnell Engineering Building
3480 Rue University
Montréal, QC H3A 2A7
Canada