Monte Carlo

Monte Carlo methods are reinforcement learning algorithms that learn by playing out complete episodes from start to finish and then updating value estimates based on the actual total returns received. Unlike temporal difference methods (like Q-Learning or SARSA) that update estimates after every step, Monte Carlo waits until an episode terminates before making any updates. The name comes from the Monte Carlo casino, reflecting the method's reliance on random sampling and averaging over many complete experiences to estimate the true value of states and actions.

How Monte Carlo Works

The agent follows a policy and records the entire sequence of states, actions, and rewards until reaching a terminal state. It then calculates the actual return (cumulative discounted reward) from each state visited during that episode:

G = r₁ + γr₂ + γ²r₃ + ... + γⁿrₙ.

These returns are used to update value estimates, typically by averaging:

V(s) ← V(s) + α[G - V(s)],

where G is the actual observed return and α is a learning rate. There are two main variants: first-visit Monte Carlo (which only updates the first time a state is visited in an episode) and every-visit Monte Carlo (which updates every time a state appears). Because updates use complete actual returns rather than estimates, Monte Carlo doesn't suffer from bootstrapping errors but requires full episodes to learn.

When to Use Monte Carlo

Monte Carlo is ideal for episodic tasks with clear endings (games, simulations with terminal states) where you can afford to wait for complete episodes before learning. It's particularly valuable when the environment dynamics are unknown or complex, since it learns purely from experience without needing a model. Monte Carlo also handles non-Markovian environments better than TD methods because it uses actual outcomes rather than estimates of future values. However, it's unsuitable for continuing tasks without natural episode boundaries, has high variance since it relies on complete random trajectories, and can be sample-inefficient since it must finish entire episodes before learning anything. Despite these limitations, Monte Carlo remains foundational for understanding reinforcement learning and is still used in tree search methods like Monte Carlo Tree Search (MCTS) that powers game-playing AI.