Bisimulation Prioritized Experience Replay:
Enhancing Online Reinforcement Learning through
Behavioral-Based Priorities
Master Thesis 2023-2024

Abstract

overview

Prioritized Experience Replay has been an effective traditional solution for value-based reinforcement learning algorithms to efficiently address non-stationary and correlated data issues. However, standard prioritization often overlooks the nuanced, task-specific behaviors of states, leading to a "task-agnostic" sampling problem. This work introduces a novel non-uniform sampling approach, named Bisimulation Prioritized Experience Replay (BPER), by incorporating a surrogate on-policy bisimulation metric into the experience replay prioritization process. This metric allows us to measure behavioral similarities and diversify the training data, aiming to enhance learning by focusing on behaviorally relevant transitions. Specifically, our method utilizes a Matching under Independent Couplings (MICo) metric, a more general surrogate metric learned through state abstractions. The proposed method balances conventional TD-error-based and bisimulation-based prioritization by reweighting priorities with an introduced hyperparameter, and two possible strategies to assigning priorities. The method demonstrates superior performance in a 31-state Grid World and shows promising results in classical pixel-based environments. The 31-state Grid World empirically validates the proof of concept by efficiently achieving to 1) emphasize behavioral relevant transition, thereby avoiding task-agnostic sampling, 2) alleviate the outdated priorities by having a better tendency to constant fixed priorities, and 3) mitigate the insufficient sample space coverage, increasing the data diversity.

Matching under Independent Couplings (MICo) Metric (Castro et al., 2022)

The present work employs a surrogate on-policy bisimulation metric known as the MICO metric (Castro et al., 2022), which confers a behavioral similarity measure between states. The MICO metric is a more general metric learned through state abstractions, which works for both deterministic and stochastic MDPs.

It is important to notice that the MICo metric is a diffuse metric, which allows zero values for distinct states, and self-distances greater than zero.

This MICo operator is used to learn a parameterized state abstraction \(\phi_\omega(x) : \mathcal{S} \to \mathcal{S}_\phi\) (parameterized by \(\omega\)), which maps a true environmental state (e.g., pixels) to a lower-dimensional latent representation. The goal is to position these representations in such a way that a chosen parameterized distance $U_\omega(x,y)$ coincides with the MICo distance in the latent space, where the states are organized according to their behavioral similarity.

The parameterized distance function \(U_\omega(x, y)\) is defined as: \begin{equation} U^\pi(x, y) \approx U_\omega(x, y):=\frac{\left\|\phi_\omega(x)\right\|_2^2+\left\|\phi_\omega(y)\right\|_2^2}{2}+\beta \theta\left(\phi_\omega(x), \phi_\omega(y)\right) \end{equation} where the first term ensures positive self-distances.

Consequently, the recursive nature of the operator \(T_M^\pi(U)(x,y)\) is used to define a loss function, which works similarly to the Bellman recurrence process in DQN. Specifically, the loss function measures the difference between a learning target MICo metric \(T_{\bar{\omega}}^U\) and the online MICo metric \(U_\omega\), as \begin{equation} \label{eq:mico_loss} \mathcal{L}_{\mathrm{MICo}}(\omega)=\mathbb{E}_{\left\langle x, r_x, x^{\prime}\right\rangle,\left\langle y, r_y, y^{\prime}\right\rangle}\left[\left(T_{\bar{\omega}}^U\left(r_x, x^{\prime}, r_y, y^{\prime}\right)-U_\omega(x, y)\right)^2\right] \end{equation} \begin{equation} T_{\bar{\omega}}^U\left(r_x, x^{\prime}, r_y, y^{\prime}\right)=\left|r_x-r_y\right|+\gamma U_{\bar{\omega}}\left(x^{\prime}, y^{\prime}\right) \end{equation} where \(\bar{\omega}\) is a separate copy of the network parameters, synchronized with \(\omega\) at infrequent intervals, and the pairs of transitions \(\left\langle x, r_x, x^{\prime}\right\rangle\) and \(\left\langle y, r_y, y^{\prime}\right\rangle\) are sampled from the experience replay.

The MICO learning can be integrated into any value-based agent by learning an estimate $Q_{\xi, \omega}(x, \cdot) = \psi_\xi(\phi_\omega(x))$, where $\phi_\omega(x)$ corresponds to the representation of state $x$, and $\psi_\xi$ corresponds to the value approximator. This work specifically focuses on the DQN algorithm, where the $Q_{\xi, \omega}$ corresponds to the Q-values. In this scenario, the MICO loss $\mathcal{L}_{\text{TD}}$ is combined with the temporal-difference loss $\mathcal{L}_{\text{MICo}}$ as \begin{equation} \mathcal{L}_\alpha(\xi, \omega) = (1 - \alpha)\mathcal{L}_{\text{TD}}(\xi, \omega) + \alpha \mathcal{L}_{\text{MICo}}(\omega) \end{equation} , where $\alpha \in (0, 1)$.

Although the MICo loss $\mathcal{L}_{\text{MICo}}$ requires pairs of transitions $\left\langle x, r_x, x^{\prime}\right\rangle$ and $\left\langle y, r_y, y^{\prime}\right\rangle$, in practice, transitions are not sampled as pairs; we only have access to a mini-batch of unpaired transitions, necessitating a method to pair them. To address this issue, a squarify method (Castro, 2019) was proposed to construct pairs of transitions by pairing each transition with all others within the current mini-batch and then calculating the MICo loss.

Bisimulation Prioritized Experience Replay (BPER)

The current work explores the behaviors between states stored in the experience replay. Note that an experience tuple in an experience replay is defined as \(e_t = (s_t, a_t, r_t, s_{t+1})\). It is evident that experiences with a higher MICO on-policy bisimulation (more behavioral dissimilar) distance between \(s_t\) and \(s_{t+1}\) are likely to be more informative (or 'surprising') than transitions between behaviorally similar states. By prioritizing transitions with greater behavioral differences, our method encourages more diversity in the sampling process. As the MICo metric is approximated online as part of the state abstraction learning process, it is enough to use $U_\omega$ to calculate a MICo distance between the current and next states and update the priorities accordingly. This is our current-next Strategy (BPERcn).

scales

The current-next Strategy faces challenges due to, in practice, the trajectories may overlap significantly, resulting in overlapping state distributions, leading to scarce high-priority transitions. An empirical solution is proposed to mitigate these issues and improve prioritization effectiveness. In this all-vs-all Strategy (BPERaa), a relative behavioral distance to the current minibatch is computed by averaging the mean distance between each current state and all other current states in the minibatch; computed as \begin{equation} U^B_\omega(s_i) = \sum_{i=1}^k U_\omega(s_i, s_k), \quad \forall e_k = (s_k, r_k, a_k, s_{k+1}) \in B \end{equation} where $B$ corresponds to the current mini-batch.

Algorithm 3 presents the pseudo-code for the entire prioritizing process. Similar to the PER method, the priorities are only updated for the current sampled mini-batch to maintain computational efficiency. Additionally, alpha-weighting the probability and importance sampling are used similarly to PER. Notice the algorithm depicts the procedure for current-next Strategy (BPERcn). all-vs-all Strategy (BPERaa) requires a minor replacement in line 19, where \(U_\omega(s_i, s_{i+1})\) should be replaced with the relative distance \(U^B_\omega(s_i)\).

scales

Results

A 31-state simple GridWorld was used as a proof of concept to evaluate three key problems in non-uniform sampling methods: task-agnostic sampling, outdated priorities, and state space coverage. After that, a general evaluation was performed on classical environments.

Episode Reward. The following plot illustrates (a) the episode reward performance over time, averaged with a moving window of 100 episodes across 5 independent executions. , and (b) the cumulative reward performance over time, averaged over 5 independent executions, with shaded regions representing the variability for each method.

scales

Task-agnostic Sampling. The following plot illustrates the evolution of priority distributions in the experience replay over time, with the final distribution (at 100k steps) highlighted at the top. The results show that both proposed strategies, BPERcn and BPERaa, consistently generate higher priority values over time, encouraging more behaviorally dissimilar transitions, while PER exhibits an accumulation of low priority values over time. Notably, BPERaa exhibits slightly greater variability in the priority values.

scales

Outdated Priorities. Recall that outdated priorities occur because priority updates are performed only on the current minibatch, rather than across the whole possible transition under the current policy.

The following plots show the distances between the sampling distribution $p_i$ for PER and BPER variants, and the ideal distributions $p^\ast_i$, achieved when all possible transitions under the current policy are updated at each time step. Two weighting schemes, the on-policy and uniform weighting, were used based on the previous work of Pan et al., 2022. Although our method does not provides distances close to zero indicating a perfect match between distributions, our method reduce and alleviate the difference between the distributions without relying on a model-based techniques.

scales

Space Coverage. Recall the state space coverage problem arises from insufficient exploration, where the experience replay only captures a limited number of possible states, leading to suboptimal learning outcomes.

Figure bellow shows the state visitation distribution at the 90k time step in the 31-state Grid World, where cell values corresponds to the visit count per state. The results indicate that the strategies BPERcn and BPERaa visit more states more frequently compared to other methods, with BPERcn performing slightly better than BPERaa. While some of the increased exploration in BPERcn and BPERaa may be attributed to the MICo learning, the bisimulation prioritized technique further enhances exploration (relative to DQN + MICo) by significantly increasing the state visit counts, reaching values around 600.

scales

Classical Environments. Figures bellow illustrate the episode reward calculated over a moving average window of 100 episodes in four different pixel-based environments, averaged over 5 independent executions. The hyperparameter priority weight $\mu$ was set using the best values found. In these results, the bisimulation strategies outperform the direct baseline DQN + MICo in most of the environments, even if they do not surpass DQN or DQN + PER.

scales

Related links

- The on-policy bisimulation metric was proposed on the paper Scalable methods for computing state similarity in deterministic Markov Decision Processes by Castro (2020).

- The MICO paper can be found on MICo: Improved representations via sampling-based state similarity for Markov decision processes by Castro et al. (2022).

- A new alternative for calculating the MICo metric using kernels is proposed on A Kernel Perspective on Behavioural Metrics for Markov Decision Processes by Castro et al. (2023).

Powered by Jon Barron and Michaƫl Gharbi.