Tag: SARSA

Monte Carlo

Monte Carlo methods are reinforcement learning algorithms that learn by playing out complete episodes from start to finish and then updating value estimates based on the actual total returns received. Unlike temporal difference methods (like Q-Learning or SARSA) that update estimates after every step, Monte Carlo waits until an episode terminates before making any updates. The name comes from the Monte Carlo casino, reflecting the method's reliance on random sampling and averaging over many complete experiences to estimate the true value of states and actions.

Fri Sep 26 2025

Q-Learning

Q-Learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function independently of the agent's actual behavior. The "Q" stands for quality, representing how good each action is in each state. Unlike SARSA, Q-Learning is off-policy because it learns about the greedy (optimal) policy while potentially following a different exploratory policy. This separation allows it to learn the best possible strategy even while taking random exploratory actions during training.

Thu Sep 25 2025

SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that learns by updating its value estimates based on the actions it actually takes. The name comes from the sequence of information it uses: it observes the current state (S), takes an action (A), receives a reward (R), moves to a new state (S), and then selects the next action (A) before updating its knowledge. Unlike Q-learning which always assumes optimal future actions, SARSA updates its estimates based on the action it will actually take next, including any exploratory random actions.

Sun Sep 21 2025

Successor Representation

Successor Representation (SR) is a reinforcement learning framework that decomposes value functions into two separate components, a representation of future state occupancy and immediate rewards. Instead of directly learning the value of being in a state, SR learns the expected discounted future visitation frequencies—essentially asking "if I start in state s and follow my policy, how much time will I spend in each other state?" This representation, combined with separate reward predictions, creates a middle ground between model-free methods (like Q-Learning) and model-based methods, enabling faster adaptation when rewards change but the environment dynamics remain constant.

Sun Sep 21 2025

1 / 1