Introducing a framework and benchmark tasks for reproducible reinforcement learning research on real robots

Kindred’s mission is to build human-like intelligence in machines that solve real-world problems, and we believe that advances in reinforcement learning (RL) are key to our success. We are excited to see recent advances in RL in simulations of continuous control problems. These advances have been supported by open-source agents and benchmark tasks in simulation that allow researchers to reproduce, analyze and build quickly upon each others’ results. Unfortunately, there are no such benchmarks for research on physical control tasks, consequently research on real-world learning has lagged behind.

To stimulate research in learning on physical robots, we are pleased to introduce six reinforcement learning benchmark tasks based on three commercially available robots. Most of these tasks require no additional hardware installation apart from the basic robot setup. These tasks are developed in SenseAct, a new open-source framework for implementing real-time reinforcement learning tasks. We furthermore provide benchmarking results from our evaluation of several state of the art learning algorithms for continuous control on these tasks (PPO, TRPO, Soft-Q, and DDPG -- see paper on arxiv). We hope these tasks and benchmark results spur further work in RL for the real world -- both in terms of improved learning algorithms and more complex and demanding tasks.

Six Benchmark Tasks for Robots

Our benchmark tasks make use of three commercially available robots at a range of price points. The UR series are precise and sensitive collaborative industrial arms with six degrees of freedom produced by Universal Robotics. The Dynamixel (DXL) series of programmable direct-current actuators, manufactured by Robotis (middle), are a popular component for enthusiast-oriented robots such as grippers, arms, mobile platforms, and even full humanoids. The iRobot Create 2 is a research-oriented version of iRobot’s popular Roomba vacuum robot -- it includes range and bump sensors, wheel control, but no vacuum cleaner.


Tasks for Universal Robotics UR

We use the Reacher task with UR developed in our previous work, which is designed analogously to OpenAI-Gym Reacher. We develop two versions of this task with 2 and 6 degrees of freedom, respectively.


Task: UR-Reacher-2

The UR-Reacher-2 task is to move a UR arm end-effector to target locations on a two-dimensional plane by controlling only two joints.


Task: UR-Reacher-6

The UR-Reacher-6 task is to move a UR arm end-effector to target locations within a three-dimensional box-shaped space by controlling all six joints.


Tasks for Robotis Dynamixel

We designed two simple control tasks for Robotis Dynamixel actuators based on the current-control mode: reach a stationary target (DXL-Reacher), and track a moving target (DXL-Tracker).


Task: DXL-Reacher

The DXL-Reacher task objective is to move the DXL servo joint to any given target position. This is one of the easiest and most-accessible tasks in the suite, and a good place to experiment with different action representations, and cycle times. A successful behaviour can be learned in about half an hour.


Task: DXL-Tracker

The DXL-Tracker task objective is to control the DXL servo joint to track a moving target. This task is a good entry point for anyone interested in extending SenseAct to implement different tasks based on Dynamixel servos.


Tasks for iRobot Create 2

We develop two tasks with Create 2 that explore learning locomotion and navigation: Create2-Mover and Create2-Docker.


Task: Create2-Mover

In Create2-Mover, the objective is to move the robot forward as fast as possible within an enclosed arena by controlling its two wheels. In order to succeed in this task, the robot needs to turn as it approaches the walls of the arena so that it can continue forward movement in a new direction.


Task: Create2-Docker

In Create2-Docker, the objective is to dock to a charging station attached to the middle of one of the wider walls of the arena. The reward function is a large positive number for successful docking with a penalty for bumping and encouragement for moving forward and facing the charging station perpendicularly. This task presents a mixture of dense and sparse rewards in a partially observable environment.


Further details of each task setup and corresponding experimental results are provided in our benchmarking paper, and in the task implementations on github.


Learning tasks with physical robots differ from typical simulated tasks in a number of ways. Sensory information arrives from unsynchronized devices and requires non-trivial design choices in the agent/environment interfaces to support learning. Learning tasks in the physical world are real-time in nature. In simulated tasks, we can pause the world, collect the data needed for the agent, compute an action, make a learning update, and advance the simulated world. Each simulation step can advance the world by exactly the same amount of simulated “time”. In real-world robotic tasks, none of this is true. Variability in latency and system computational load result in delays and variable durations of “time steps”, which conspire to create different and more difficult learning problems.


SenseAct provides a framework for concurrent computations and processing of sensory and actuation information aiming to minimize action-observation delays in the system and control cycle time variability. Shown in the figure above, SenseAct consists of several concurrent processes for communicating and processing the incoming sensory information and the outgoing actuation commands. Overall, the architecture dictates the order of computations and timing of the interactions between components. SenseAct provides an OpenAI Gym environment interface to the learning agent, which allows to readily use a plethora of available off-the-shelf learning algorithm implementations on real-world tasks. The details can be found in a paper and in git repository.

Benchmarking Results

Our first use of these new tasks was to benchmark several state-of-the-art deep reinforcement learning algorithms for continuous control. To summarize our recent arxiv paper, we ran more than 450 independent experiments ( 950+ hours of robot usage) to evaluate PPO, TRPO, Soft-Q and DDPG on each task, for a range of hyper-parameters. The figure below shows learning curves of selected algorithms on each task.


This study indicates the viability of reinforcement learning research based on real-world experiments, which is essential to understand the difficulties of learning with physical robots and mitigate them to achieve fast and reliable learning performance in dynamic environments.