James B. Holliday and T.H. Ngan Le, University of Arkansas, USA
Combining both value-iteration and policy-gradient, Asynchronous Advantage Actor Critic (A3C) by Google’s DeepMind has successfully optimized deep neural network controllers on multi agents. In this work we propose a novel exploration strategy we call “Follow then Forage Exploration” (FFE) which aims to more effectively train A3C. Different from the original A3C where agents only use entropy as a means of improving exploration, our proposed FFE allows agents to break away from A3C's normal action selection which we call "following" and "forage" which means to explore randomly. The central idea supporting FFE is that forcing random exploration at the right time during a training episode can lead to improved training performance. To compare the performance of our proposed FFE, we used A3C implemented by OpenAI’s Universe-Starter-Agent as baseline. The experimental results have shown that FFE is able to converge faster.
Reinforcement Learning, Multi Agents, Exploration, Asynchronous Advantage Actor Critic, Follow Then Forage.