Reinforcement Learning (RL) is a promising approach to solving complex real world tasks with physical robots, supported by recent successes, e.g. in grasping and object manipulation. In RL, a decision-making *agent* interacting with the world discovers new behaviours by trial and error, sometimes exploring new ways to do things, and sometimes exploiting what it has already found to work well. Efficient exploration of alternative behaviours is the key to reinforcement learning.

How do reinforcement learning agents explore new behaviours? Typically exploration is implemented by adding random noise to the agent's prediction of what would be the best action in a given state. Some tasks involve choosing the right action from a finite set (like playing chess or Go) and some tasks, especially in robotics, involve choosing actions that correspond to motor torques, arm joint velocities, and so on that are best described with real numbers. Exploring new actions in these so-called *continuous* action spaces is typically done by adding independent Gaussian noise to the agentâ€™s best action predictions. However, independent noise results in temporally inconsistent actions. This poses two major challenges for real world tasks. First, such exploration does not result in an effective exploration of an environment and, incidentally, becomes even less efficient as the action rate increases. In practical applications high action rate is desirable as it allows for short reaction times. Second, temporally incoherent exploration results in jerky movement on physical robots leading to safety concerns, and possible hardware damage.

To alleviate these challenges, we introduce *autoregressive policies* (ARPs) for continuous control deep reinforcement learning --see paper on arxiv and the code on github. ARPs are *history-dependent* policies with a special structure so that they produce action samples according to *stationary autoregressive (AR) stochastic **processes*. AR processes represent sequences of random variables that are temporally coherent with each other through a simple linear relationship, creating an attractive model for exploration in real world learning tasks. We propose a principled way to incorporate these processes into policies parameterized by deep neural networks. The resulting policies can be used with existing learning algorithms, such as PPO or TRPO, and exhibit superior exploration and learning, especially in sparse reward tasks. By adjusting the parameters of an underlying AR process, ARPs can maintain their performance at higher action rates. In real world robotic experiments we demonstrate that ARPs result in smoother and safer motion compared to conventional Gaussian policies.

# Autoregressive Processes

We derive a general form of AR processes of an arbitrary order with two desirable features: the marginal distribution of process observations at each step is standard normal, and the degree of temporal coherence between subsequent observations is continuously adjustable, from white noise on one extreme to a constant function on the other, with a single scalar smoothing parameter. The figure below shows sample trajectories of several such processes of different order (p) and smoothing parameter (đťžŞ) values.

The trajectories on the left-most plot do not represent useful and safe behavior on a physical robot. On the other hand, trajectories on the right-two plots resemble more useful exploratory behaviour in the real world. We would like to build policies that produce exploration trajectories like those.

# Autoregressive Policies

In order to design a policy that produces samples according to realizations of an AR-process, we require a fixed size history of recent actions and observations, where the size of the history is equal to the process order (p). The AR policies that we propose therefore lie in a space of history-dependent policies. A classical result in the theory of Markov Decision Processes (MDPs) states, that for any history-dependent policy there exists an equivalent Markov stochastic policy with identical expected returns from each state. This means that there exists Markov policies equivalent to ARPs. Why bother with history-dependent policies then and not perform a policy search directly in the set of Markov policies? Unfortunately, we do not know how to constrain Markov policies to temporally consistent ones with a tractable equation that we could use in learning algorithms. Conventional parametrization of Gaussian mean and variance with neural networks is not restrictive enough and allows for policies with temporally inconsistent behaviour. We are, however, able to write down a history-dependent policy formulation that produces temporally coherent behaviour right from the start with randomly initialized neural networks.

# Benefits of a consistent exploration

To highlight the need for temporally consistent exploration, we designed a toy continuous control problem, where the agentâ€™s objective is to reach a randomly placed red target by controlling a blue dot with continuous velocity control.

We compared a conventional Gaussian random agent to an ARP random agent on this problem. Failing to explore the environment well, the Gaussian random agent finds the target less frequently, and its performance drops further with an increasing action rate. On the other hand, by adjusting the parameters of the underlying AR process, ARPs are able to maintain stable exploration performance at wide range of action rates.

On a physical robot, a temporally coherent exploration is necessary, as inconsistent behaviour causes jerky motion that can lead to hardware damage and precludes running the robot at high speeds. We ran a set of tests on a UR5 robot and a sparse reward Reacher task designed in SenseAct. ARPs produced superior exploration performance, while exhibiting smoother motion compared to conventional Gaussian policy.

# Learning experiments

We performed a set of learning experiments in both simulated and real world robotic tasks and compared ARP and Gaussian policies with identical neural network architectures. We found that ARPs exhibited similar or slightly better learning performance in dense reward tasks, and substantially better learning performance in sparse reward tasks.