Actor-Critic

Actor-Critic is a hybrid reinforcement learning architecture that combines value-based and policy-based methods by maintaining two separate components working together. The "actor" maintains a policy that decides which actions to take in each state, while the "critic" maintains value estimates that evaluate how good those actions are. This dual structure allows the actor to learn what to do directly while the critic provides feedback on whether those decisions are improving, combining the stability of value-based learning with the flexibility of policy-based learning.

How Actor-Critic Work

The actor stores a policy—often as a table of action probabilities for each state—and selects actions based on these probabilities. After taking an action and observing the reward, the critic evaluates the decision by computing a TD (temporal difference) error:

δ = r + γ·V(s') - V(s),

which measures whether the outcome was better or worse than expected. This TD error serves as the "advantage" signal that guides learning. The actor then updates its policy to increase the probability of actions with positive TD errors (better than expected) and decrease the probability of actions with negative TD errors (worse than expected), while the critic simultaneously updates its value estimates to provide more accurate feedback in the future.

When to Use Actor-Critic

Actor-Critic methods are valuable when you need to learn stochastic policies (policies that deliberately randomize actions) or when the optimal solution requires probability distributions over actions rather than deterministic choices. They provide faster learning than pure policy gradient methods because the critic reduces variance in the learning signal, and they handle large or continuous action spaces better than Q-Learning which requires computing the maximum over all actions. The tradeoff is increased complexity—you're maintaining and updating two structures instead of one—but this approach often converges more reliably than pure policy methods and handles a broader range of problems than pure value methods, making it a versatile middle-ground approach for many reinforcement learning tasks.