Deep Reinforcement Learning: An Overview and Introduction
Deep reinforcement learning (DRL) is a captivating field that combines the power of deep neural networks with the ability to act on the world. It's about creating intelligent beings that understand the world and act accordingly, leading to exciting breakthroughs and capturing our imagination about what's possible. This makes it a favorite area in deep learning and AI.
What is Deep Reinforcement Learning?
DRL takes the power of deep learning – the ability to compress and encode data in a way that allows reasoning – and applies it to sequential decisionmaking in the real world. It tackles problems where an agent must make a sequence of decisions that affect the environment. The key to learning, especially when starting with limited knowledge, is trial and error. The "deep" aspect comes from using neural networks to represent the world and inform actions.
Supervised Learning vs. Reinforcement Learning: A Matter of Supervision
While supervised learning often implies manual annotation, it's crucial to remember that all machine learning is supervised, guided by a loss function that indicates what's good or bad. The difference lies in the source of supervision. Unsupervised learning minimizes human labor, but there's always some human input defining what's considered good or bad. This applies to reinforcement learning as well. The exciting challenge is finding the most efficient way to provide this supervision.
Supervised learning involves learning from a dataset with ground truth, while reinforcement learning teaches an agent through experience in an environment. The crucial element is designing the environment and, crucially, the reward structure, which defines what's good and bad.
The Essential Element: Reward Structure
Consider a baby learning to walk. Success is reaching the destination; failure is not. The mystery lies in how humans learn so quickly from limited trial and error. It could stem from vast amounts of preexisting data (evolutionary history of bipedalism and vision), rapid learning through observation and imitation in early life, or a currently unknown algorithm used by the brain. As we explore the comparatively trivial accomplishments of DRL, let's consider these possibilities.
The Agent's Perspective: Sensing, Representing, Learning, and Acting
An agent in the world interacts through a stack of processes, going from input to output. These are:
- Sensing the Environment: Using various sensory systems (e.g., lidar, cameras, microphones) to gather raw data.
- Representing the Data: Transforming raw sensory data into meaningful representations using deep learning to form higherorder abstractions.
- Learning: Accomplishing useful tasks (discriminative or generative) based on the representation.
- Aggregating Information: Integrating past information relevant to the task at hand.
- Acting: Providing actions within the environment's constraints to achieve success.
The promise of deep learning is converting raw data into meaningful representations, while DRL aims to build agents that use those representations to act successfully in the world. The basic framework involves an agent sensing the environment, taking an action, and receiving a reward. The environment then changes, leading to a new observation.
One question remains: Can everything be modeled in this way? Is this a good formulation for learning in robotic systems, both simulated and real?
Challenges and Opportunities in RealWorld Applications
Supervised learning is teaching by example; reinforcement learning is teaching by experience. Currently, most DRL agents learn through simulation or highly constrained realworld scenarios. Therefore, the challenge lies in the gap between simulation and reality. Solutions include:
- Improving algorithms to create policies transferable across domains, especially to the real world.
- Improving simulations to increase fidelity, minimizing the realitysimulation gap.
Key Components of an RL Agent
- Policy: A strategy the agent uses to make decisions.
- Value Function: An estimate of how good a state or stateaction pair is.
- Model: The agent's representation and understanding of the environment.
The purpose of an RL agent is to maximize reward, often using a discounted framework to prioritize nearterm rewards. Discounting is used both as a mathematical tool for proving convergence and to model the uncertainty of future rewards.
The Robot in a Room: Example of Policy and Reward
Imagine a robot in a room with cells. It starts at the bottom left and must get to the top right (+1 reward), but it has to avoid the cell near the top right corner (1 reward). There is a cost to each step. In a deterministic world, you would always follow the shortest path to the objective because you will always end up at your desired action. However, in a stochastic world, there is an 80% chance you'll move up, but a 10% chance you move to the left, and another 10% chance you move to the right. To get to +1, you might want to plan accordingly, avoiding the 1 if it's close by.
Important Lessons from the Example:
- Environment Model: The dynamics of the world significantly impact the optimal policy.
- Reward Structure: What is considered good or bad and how good or bad it is. Controlling this has a significant influence on the policy.
Unintended Consequences and AI Safety
When formulating a reinforcement learning framework, it is important to understand that slight variations in world parameters and the reward structure can yield transformative results, drastically altering the resulting policy. The implications of AI safety, especially in areas like autonomous vehicles navigating complex intersections, are farreaching.
An illustrative example involves a game where an agent is supposed to complete a race quickly, but it is given points for picking up green turbo boosts. The unintended consequence is that the agent focuses solely on picking up boosts repeatedly, since it gives more points than completing the race. In realworld systems, such objective function issues can have highly detrimental consequences, especially in situations involving human lives. It is crucial that objective functions accurately capture what we want to optimize, otherwise, there can be negative side effects.
Examples of Reinforcement Learning Systems
The following are a few common applications of reinforcement learning systems:
- Cart Pole: An agent has to apply horizontal force to a cart, where the ultimate goal is to keep a pole upright. A positive reward is provided for each step the pole remains upright.
- Game of Doom: A pixellearning agent has to eliminate all opponents. Rewards are provided for eliminating opponents, while negative rewards are provided if the agent itself is eliminated.
- Object Manipulation: The goal is to have a robot pick up an object. An agent is rewarded if the pickup is successful.
As we increase application of robotics in the real world, it will be important to start encoding what it is that we humans encode. Specifically, our own objective functions, reward structures, and models of the environment need to be encoded into autonomous vehicles and robots in order to promote safe and responsible decisionmaking.
Key Takeaways for RealWorld Impact
- Deep Learning: The algorithms and data required to train models
- Reinforcement Learning: Defining the environment, the action space, and the reward structure
Types of Reinforcement Learning
- Model Based Algorithms will interact with the world in order to construct their estimate of the dynamics of that world
- Value Based Estimates the quality of taking a certain action in a certain state
- Policy Based Directly learns a policy function
Approaches and Algorithms
Modelbased algorithms learn a model of the world and use it for planning. Valuebased methods estimate the quality of states and actions, learning how good it is to be in a state and using that information to pick the best action. Policybased methods directly learn a policy function, outputting actions based on the world representation. Examples of these algorithms include Deep Q Learning Networks (DQN), Policy Gradients, Advantage Actor Critic Methods (A2C), and Deep Deterministic Policy Gradient (DDPG).
DQN: A Breakthrough in Atari Games
Qlearning estimates how good it is to take an action in a state based on an optimal policy. Initially, there's no knowledge of this value, so it must be learned through a Bellman equation, updating the current estimate with the reward received. Exploration is crucial in the beginning to navigate and determine the value of the system.
However, for practical realworld situations, updating a table is impractical. For example, a sensory input of a game (4 frames of 84x84 pixels with 256 values), the table's size becomes too difficult to compute. Deep RL comes to the rescue, using neural networks to learn the compress representation and form approximation of this Q function.
Improving the DQN Network
Two important tricks that enable deep learning networks to play games are:
- Experience Replay: an agent collects different memories and replays these at random
- Fixing a Target Network: the target network defines the loss function and fixes it so the backpropagation algorithms can follow it instead of being dynamic
Policy Gradients
Unlike valuebased methods, policy gradient directly optimizes the policy by learning from the outcomes of actions. Actions are taken along the way in the environment and rewarded (or punished) based on whether they led to victory or defeat. Although the credit assignment is tough in these methods, it's still quite amazing that this method still works.
ModelBased Methods
It is possible to make modelbased systems using deep learning methods. The models are then able to find quality moves and promising positions in games like Chess, Go, or Shogi. For example, AlphaZero has used a Monte Carlo Tree Search to analyze what's good to explore. The neural network is also tasked with figuring out good boards. Overall, modelbased methods are quite efficient and extremely powerful
The Challenge: Simulation to Real World
The majority of agents operating in the real world don't learn their actions from data; instead, they use preprogrammed rules. This is true for autonomous vehicle companies, as well as humanoid robots that operate in uncertain conditions. In order to improve this, we can:
- Develop better transfer learning algorithms
- Decrease the distance between the simulation world and the real world
- Increase the number of simulations (regularization)
Further Resources
- Lecture Videos will be available on Deep Learning at MIT
- Tutorials in RL are available on Github
- Essay from OpenAI