Reinforcement learning beat games such as Backgammon and Go, and is paving a path for smarter robots


Reinforcement learning is a general mathematical framework for modeling learning in dynamical systems, such as software, robots, even animals. The formalism of modern reinforcement learning (abbreviated “RL”) has been the basis for a growing body of research for several decades in computer science (where it is part of “machine learning”), in engineering (sometimes called “optimal control”), in psychology (especially “trial and error learning”) and in economics (especially “decision theory”).

The core idea of reinforcement learning is that a learning dynamical system can be modeled as two interacting components: an Agent that makes decisions (and learns to make better ones) and an Environment that responds to those choices and provides the Agent with the consequences of its actions. In RL, actions have two kinds of consequence:

  1. The sensor values that characterize what’s going on (state), and

  2. Some amount of goodness or utility that is immediately realized by the system (reward)

The interaction between Agent and Environment is canonically depicted as a cyclic flow chart of actions (At) leading to the next state (St+1) and reward (Rt+1):

Web 1920 – 2 (1).png

This model makes it possible to talk about what “learning in machines” means in a precise way. RL defines learning as the Agent changing its behaviour so that the Environment generates more reward, on average, as the system advances through time. In terms of utility, we would say an Agent learns when it adapts to maximize expected utility.

RL algorithms are recipes for choosing each action in a way that is informed by the history of previous actions, states, and rewards. Consider the simple case where an Agent can choose between two possible actions: "left and "right".  If “left” always yielded more reward than “right” ever did, then after trying each choice a few times, a sensible way to maximize reward over time would be to stop choosing the inferior "right". This is what we expect of RL algorithms too.

Researchers have modeled many systems as Agents and Environments, in order to apply RL algorithms to make the systems exhibit desirable behavior. It is popular to apply RL algorithms to games, where “desirable behavior” means winning the game. RL algorithms have learned to master games of planning and judgement that previously stumped AI, such as TD-Gammon (Backgammon), AlphaGo (Go), and popular computer games such as DQN (ATARI games), and OpenAI-Five (Dota 2). But RL is not always about playing games, Google researchers have used it to reduce the operating cost of data centers by up to 40%.

RL is fundamental to Kindred’s AI because it offers a direct articulation of our product design objectives. SORT, our first commercial grasping robot, is tasked with grasping, identifying, and stowing a large variety of objects within a dynamic environment. To achieve this, Kindred’s AI must make a huge number of choices about where and how to move, with the objective of maximizing a combination of throughput and other customer metrics.

Kindred SORT

Kindred SORT


The first way that we’ve drawn on RL to power SORT, was to model how SORT’s automatic grasping routine, which we call AutoGrasp™, interacts with the rest of our system. From an RL perspective, AutoGrasp™ can be seen as an Agent that makes choices on the basis of sensor readings in order to grasp, scan, and stow items quickly and accurately. When we see AutoGrasp™ as an Agent, everything else about SORT’s operation is viewed as the Environment. We model their interaction as a Contextual Bandit learning problem, also known as Associative Search. For more information about this and other RL problem statements, Reinforcement Learning: An Introduction is a great place to start.


Over time, AutoGrasp™ has built up a large database history of states (images of item configurations), actions (grasp attempts) and rewards (outcome data). We use RL iteratively to learn better grasping behaviors, which provide continued improvements to SORT’s overall system performance.

By drawing on many Agent/Environment interpretations of SORT and other products, we are simultaneously and incrementally applying reinforcement learning algorithms to many aspects of our system’s operation. Over time, Kindred AI Agents are learning to operate diverse robots in diverse situations. Our ongoing publication-oriented research develops best practices for studying learning algorithms in hardware rather than in simulation, and our internal R&D applies those techniques to create products that learn on the job to deliver optimal performance to our customers.