Understanding the Landscape of Reinforcement Learning
This blog post summarizes a lecture on the organization of reinforcement learning (RL) algorithms. It builds upon previous introductory videos, diving into the practical implementation of RL. This aims to be a useful guide for understanding the diverse approaches within this field.
This content is based on a chapter from the second edition of the book, "Data Driven Science and Engineering".
What is Reinforcement Learning? A Quick Recap
In RL, an agent interacts with an environment. The agent takes actions, which can be discrete (e.g., chess moves) or continuous (e.g., robot arm movements). The agent observes the state of the system and uses this information to choose actions that maximize current or future rewards. A significant challenge is that rewards can be delayed, like in chess where you only know if you've won at the end of the game.
The agent's control strategy is called a policy (π). It's essentially a set of rules determining actions based on the observed state, aiming to maximize future rewards. Usually the policy is defined as a probability of taking action a given a state s.
A value function is associated with each policy. It represents the expected future reward for being in a particular state. Future rewards are often discounted to reflect that they are less valuable than current rewards.
The overarching goal of RL is to learn, through trial and error, the optimal policy that maximizes future rewards. This is a complex problem, driving ongoing research and development of new techniques. For complex problems like chess or go, the state space is astronomically large, making bruteforce solutions impossible.
Organizing Reinforcement Learning Techniques
Here's an overview of the key organizational aspects of mainstream reinforcement learning types:
ModelBased vs. ModelFree Reinforcement Learning
The primary distinction is between modelbased and modelfree RL.
ModelBased RL
If you have a good model of the environment (e.g., a Markov Decision Process or a differential equation), you can use modelbased RL.
- Markov Decision Process (MDP): If the probability of moving from one state to the next is known, policy iteration and value iteration can be used to optimize the policy function. These are based on dynamic programming and rely on the Bellman optimality condition for the value function.
- Deterministic Systems (Continuous Control): For systems like robots, where behavior can be defined with an equation like x dot = f(x, u), linear optimal control (LQR, Kalman filters) is a special case of nonlinear control using the HamiltonJacobiBellman equation.
While mathematically elegant, dynamic programming often results in bruteforce searches and doesn't scale well to highdimensional systems. However, the principles of modelbased control inform many modelfree approaches.
ModelFree RL
In many realworld scenarios (e.g., chess), a complete model of the environment is unavailable. Modelfree RL addresses this by learning through trial and error, approximating dynamic programming without explicitly having a model.
The main division within modelfree RL is between gradientfree and gradientbased methods.
GradientFree Methods
Gradientfree methods are used when gradient information is unavailable. These can be further divided into onpolicy and offpolicy approaches.
- OnPolicy: The agent always plays its best possible game, constantly using what it thinks the value function is. SARSA (StateActionRewardStateAction) is an onpolicy algorithm.
- OffPolicy: The agent tries things it knows are suboptimal to explore and learn. This can be very valuable for gathering information. Qlearning is an offpolicy variant of SARSA. The quality function Q contains information for optimal policy and value function, learning what happens even with suboptimal control.
Qlearning is important for imitation learning, where you can learn by watching others play even if you don't know their strategy. Much of modern machine learning is related to Qlearning.
GradientBased Methods
Gradientbased algorithms directly update policy parameters using gradient optimization (e.g., Newton's method, steepest descent). This is typically the fastest approach when applicable.
Deep Reinforcement Learning
The rise of deep learning has led to deep reinforcement learning. Deep neural networks are used to:
- Learn a model for modelbased RL.
- Represent modelfree concepts (e.g., Qfunction, policy) allowing differentiation and gradientbased optimization.
Techniques like Deep Model Predictive Control and ActorCritic methods are also gaining renewed interest because of deep neural networks. The are very impressive demonstrations of RL, where computers beat grand masters in go using Deep Learning.
Summary
This organizational chart provides a foundation for choosing the right RL algorithm:
- Model Available: Use dynamic programming based on Bellman optimality.
- Model Unavailable: Choose gradientfree or gradientbased methods.
- OnPolicy vs. OffPolicy: Consider the specific needs of the problem. SARSA is more conservative, while Qlearning converges faster.
- Deep Learning: Can enhance all these methods for more powerful representations.
Future videos will delve deeper into specific techniques like policy/value iteration, Qlearning, optimal nonlinear control, and policy gradient optimization.