SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that learns by updating its value estimates based on the actions it actually takes. The name comes from the sequence of information it uses: it observes the current state (S), takes an action (A), receives a reward (R), moves to a new state (S), and then selects the next action (A) before updating its knowledge. Unlike Q-learning which always assumes optimal future actions, SARSA updates its estimates based on the action it will actually take next, including any exploratory random actions.

How SARSA Works

The algorithm maintains a Q-table that stores value estimates for every state-action pair. When the agent takes an action and observes the result, it updates the Q-value using the formula:

Q(s,a) ← Q(s,a) + α[r + γ·Q(s',a') - Q(s,a)],

where α is the learning rate, γ is the discount factor, and crucially, a' is the actual next action the agent will take (chosen by its current policy, often ε-greedy). This means SARSA learns the value of the policy it's actually following, including the exploration strategy, making it an on-policy learner that's more conservative and realistic about its capabilities. This is in contrast to the off-policy nature of Q-Learning which takes best possible next action regardless of what action the agent will actually take.

When to Use SARSA

SARSA is particularly valuable in real-world applications where safety during learning matters, such as robotics, autonomous vehicles, or healthcare, because it accounts for the fact that the agent might take exploratory actions that could be risky. While this makes SARSA more conservative and potentially slower to converge to the optimal policy compared to Q-learning, it learns policies that are safer during training because it doesn't assume perfect future behavior. It's the preferred choice when you can't afford catastrophic mistakes during the learning process and need the agent to be cautious about risky states even during exploration.