Expected SARSA

Expected SARSA is a reinforcement learning algorithm that bridges the gap between SARSA and Q-Learning by using the expected value of the next action instead of either the actual next action (like SARSA) or the maximum value (like Q-Learning). Rather than relying on a single sampled action, Expected SARSA calculates the weighted average of all possible next actions according to the current policy. This makes it more stable and robust than standard SARSA because it eliminates the variance that comes from sampling a single random action, while still respecting the policy being followed.

How Expected SARSA Works

The algorithm maintains a Q-table like SARSA and Q-Learning, but updates values using the formula:

Q(s,a) ← Q(s,a) + α[r + γ·Σ π(a'|s')·Q(s',a') - Q(s,a)],

where π(a'|s') is the probability of taking action a' in state s' under the current policy. For an ε-greedy policy, this means weighting the best action with probability (1-ε) and all other actions with probability ε divided by the number of actions. By averaging over all possible next actions rather than sampling just one, Expected SARSA reduces update variance and typically learns faster and more reliably than standard SARSA, especially when the policy has significant randomness.

When to Use Expected SARSA

Expected SARSA is particularly useful when you want the safety characteristics of on-policy learning (like SARSA) but with better sample efficiency and stability. It performs well in stochastic environments where outcomes are unpredictable, as it handles randomness more gracefully by considering expected values rather than single samples. Expected SARSA is ideal for applications requiring both safety during learning and faster convergence, such as robotics with noisy sensors, financial systems with market uncertainty, or any scenario where you want conservative behavior without sacrificing too much learning speed. It generally outperforms standard SARSA and can even match Q-Learning's performance while maintaining safer on-policy guarantees.