Bisimulation Prioritized Experience Replay:
Enhancing Online Reinforcement Learning through
Behavioral-Based Priorities
Master Thesis 2023-2024

Abstract

overview

Prioritized Experience Replay has been an effective traditional solution for value-based reinforcement learning algorithms to efficiently address non-stationary and correlated data issues. However, standard prioritization often overlooks the nuanced, task-specific behaviors of states, leading to a "task-agnostic" sampling problem. This work introduces a novel non-uniform sampling approach, named Bisimulation Prioritized Experience Replay (BPER), by incorporating a surrogate on-policy bisimulation metric into the experience replay prioritization process. This metric allows us to measure behavioral similarities and diversify the training data, aiming to enhance learning by focusing on behaviorally relevant transitions. Specifically, our method utilizes a Matching under Independent Couplings (MICo) metric, a more general surrogate metric learned through state abstractions. The proposed method balances conventional TD-error-based and bisimulation-based prioritization by reweighting priorities with an introduced hyperparameter, and two possible strategies to assigning priorities. The method demonstrates superior performance in a 31-state Grid World and shows promising results in classical pixel-based environments. The 31-state Grid World empirically validates the proof of concept by efficiently achieving to 1) emphasize behavioral relevant transition, thereby avoiding task-agnostic sampling, 2) alleviate the outdated priorities by having a better tendency to constant fixed priorities, and 3) mitigate the insufficient sample space coverage, increasing the data diversity.

NOTE: Content TBD waiting thesis released.

Related links

- The on-policy bisimulation metric was proposed on the paper Scalable methods for computing state similarity in deterministic Markov Decision Processes by Castro (2020).

- The MICO paper can be found on MICo: Improved representations via sampling-based state similarity for Markov decision processes by Castro et al. (2022).

- A new alternative for calculating the MICo metric using kernels is proposed on A Kernel Perspective on Behavioural Metrics for Markov Decision Processes by Castro et al. (2023).

Powered by Jon Barron and Michaƫl Gharbi.