by Glen Berseth
Université de Montréal, Mila Québec AI Institute, and CIFAR.
Given the scaling of reinforcement learning to larger networks and its use by OpenAI, Google, and others to fine-tune their large models, leading towards impressive results at the International Mathematics Olympiad (DeepMind, OpenAI), how much more can we improve deep reinforcement learning? Many papers have demonstrated that training a deep learning policy under this changing state distribution results in suboptimal performance. This naturally leads to the concern that even if the community creates better exploration algorithms or objectives, will the improved experience fall on the deaf ears of the optimization difficulties?
Many papers have demonstrated that training a deep learning policy under a changing state distribution (non-IID) results in suboptimal performance [1,2,3]. However, at a macro scale, it is not completely clear what causes these issues. Do the network and regularization changes from recent work improve exploration or exploitation, and which of these two issues is the larger concern to be addressed to advance deep RL algorithms? If there were a way to estimate the difference between the learned policy $V^{\pi^{\theta}}(s_0)$ (avg) and the best experience it has ever generated $V^{\hat{\pi}^}(s_0)$ (replay), it would be possible to determine if the sub-optimality is from poor exploitation $V^{\hat{\pi}^{}}(s_0) >> V^{\pi^{\theta}}(s_0)$. We can compute $V^{\hat{\pi}^{}}(s_0)$ as the best trajectory the agent ever generated during learning. Below we show an example of this concept on the left and a result on the MinAtar Space Invaders environment using the DQN algorithm, where we compute the policy for $V^{\hat{\pi}^{}}(s_0)$ as:
$$ \hat{\pi}^{*} = \argmax_{<a_0, \ldots, a_t> \in D^{\infty}} \sum_{t=0}^{T} r(a_t, s_t) $$
Example exploitation sub-optimality difference
DQN on deterministic MinAtar space invaders
When comparing the performance of $V^{\hat{\pi}^{*}}(s_0)$ to a deterministic version of the learned policy $V^{\hat{\pi}^{\theta}}(s_0)$ (deterministic) in a version of MinAtar Space Invaders that is deterministic, we can see that there is a large difference in performance.
We construct several other estimators for optimal policy performance based on the generated experience:
The best and recent estimators are more practical and less susceptible to issues when the environment is stochastic. The recent estimator is particularly interesting because we would expect an algorithm like DQN, which still has access to that better experience, to learn well, but as we see here and later, that is not the case.
The large difference between the better experience and the learned policy is true across many difficult environments. We can also see in the environments below that the difference does not decrease with increased training time, indicating that the lack of improved performance is due to more complex optimization problems. In HalfCheetah, there is no difference between this task and those that are now solvable by most algorithms.
PPO on HalfCheetah
PPO on Montezuma's Revenge
DQN on MinAtar Breakout
DQN on NameThisGame
DQN on Atari BattleZone
DQN on Atari Asterix
The short answer is yes. If we add RND as an exploration bonus, the difference between $V^{\hat{\pi}^{*}_{D}}(s_0) - V^{\pi^{\theta}}(s_0)$ increases. This difference is exactly what is plotted in the next results. If this difference is increasing, it indicates the goodness of the experience is outpacing the model's ability to learn from that experience. This is an unfortunate situation; it indicates that if we create new exploration objectives, the potential improvements are more likely to be lost because of these exploitation issues.