Tag: Off-Policy

Actor-Critic

Actor-Critic is a hybrid reinforcement learning architecture that combines value-based and policy-based methods by maintaining two separate components working together. The "actor" maintains a policy that decides which actions to take in each state, while the "critic" maintains value estimates that evaluate how good those actions are. This dual structure allows the actor to learn what to do directly while the critic provides feedback on whether those decisions are improving, combining the stability of value-based learning with the flexibility of policy-based learning.

Thursday, March 17, 1932

Q-Learning

Q-Learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function independently of the agent's actual behavior. The "Q" stands for quality, representing how good each action is in each state. Unlike SARSA, Q-Learning is off-policy because it learns about the greedy (optimal) policy while potentially following a different exploratory policy. This separation allows it to learn the best possible strategy even while taking random exploratory actions during training.

Wednesday, March 18, 1931

SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that learns by updating its value estimates based on the actions it actually takes. The name comes from the sequence of information it uses: it observes the current state (S), takes an action (A), receives a reward (R), moves to a new state (S), and then selects the next action (A) before updating its knowledge. Unlike Q-learning which always assumes optimal future actions, SARSA updates its estimates based on the action it will actually take next, including any exploratory random actions.

Friday, March 18, 1927

Successor Representation

Successor Representation (SR) is a reinforcement learning framework that decomposes value functions into two separate components, a representation of future state occupancy and immediate rewards. Instead of directly learning the value of being in a state, SR learns the expected discounted future visitation frequencies—essentially asking "if I start in state s and follow my policy, how much time will I spend in each other state?" This representation, combined with separate reward predictions, creates a middle ground between model-free methods (like Q-Learning) and model-based methods, enabling faster adaptation when rewards change but the environment dynamics remain constant.

Friday, March 18, 1927

Expected SARSA

Expected SARSA is a reinforcement learning algorithm that bridges the gap between SARSA and Q-Learning by using the expected value of the next action instead of either the actual next action (like SARSA) or the maximum value (like Q-Learning). Rather than relying on a single sampled action, Expected SARSA calculates the weighted average of all possible next actions according to the current policy. This makes it more stable and robust than standard SARSA because it eliminates the variance that comes from sampling a single random action, while still respecting the policy being followed.

Tuesday, March 17, 1925

1 / 1