Tag: Off-Policy

Actor-Critic
Actor-Critic is a hybrid reinforcement learning architecture that combines value-based and policy-based methods by maintaining two separate components working together. The "actor" maintains a policy that decides which actions to take in each state, while the "critic" maintains value estimates that evaluate how good those actions are. This dual structure allows the actor to learn what to do directly while the critic provides feedback on whether those decisions are improving, combining the stability of value-based learning with the flexibility of policy-based learning.
Thursday, March 17, 1932
Q-Learning
Q-Learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function independently of the agent's actual behavior. The "Q" stands for quality, representing how good each action is in each state. Unlike SARSA, Q-Learning is off-policy because it learns about the greedy (optimal) policy while potentially following a different exploratory policy. This separation allows it to learn the best possible strategy even while taking random exploratory actions during training.
Wednesday, March 18, 1931
SARSA
SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that learns by updating its value estimates based on the actions it actually takes. The name comes from the sequence of information it uses: it observes the current state (S), takes an action (A), receives a reward (R), moves to a new state (S), and then selects the next action (A) before updating its knowledge. Unlike Q-learning which always assumes optimal future actions, SARSA updates its estimates based on the action it will actually take next, including any exploratory random actions.
Friday, March 18, 1927
Expected SARSA
Expected SARSA is a reinforcement learning algorithm that bridges the gap between SARSA and Q-Learning by using the expected value of the next action instead of either the actual next action (like SARSA) or the maximum value (like Q-Learning). Rather than relying on a single sampled action, Expected SARSA calculates the weighted average of all possible next actions according to the current policy. This makes it more stable and robust than standard SARSA because it eliminates the variance that comes from sampling a single random action, while still respecting the policy being followed.
Tuesday, March 17, 1925
1 / 1