Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Convergence

SDDP is an iterative algorithm, so it needs well-defined criteria for when to stop. This page explains how convergence is monitored through bounds, what the five available stopping rules are and how their formulas work, and how to obtain a deterministic upper bound independent of the risk measure.

Lower Bound

The first-stage LP objective — immediate cost plus the cut-based approximation of future cost — provides a deterministic lower bound at iteration :

Because each iteration adds new cuts that can only tighten the approximation, the lower bound increases monotonically. It never decreases, making it a reliable progress indicator.

Statistical Upper Bound

The average total cost across all forward pass trajectories provides a statistical upper bound :

This is a Monte Carlo estimate of the true policy cost, so it carries sampling noise. It is typically reported with a confidence interval based on the sample standard deviation across trajectories.

For risk-averse problems (e.g., CVaR objectives), the statistical upper bound is not valid in the classical sense — the sample average is not a valid upper bound on a risk measure. In such cases, the deterministic upper bound described below is necessary.

Deterministic Upper Bound via Inner Approximation

For a convergence certificate valid under any risk measure, Cobre supports a deterministic upper bound through inner approximation (SIDP). The idea is to build a concave piecewise-linear overestimate of the value function from the inside, complementing the convex piecewise-linear underestimate built by cuts from the outside.

The inner approximation stores vertices — visited state-value pairs collected during forward passes. At any query state , the upper bound is interpolated via the Lipschitz condition:

where is the Lipschitz constant for stage , computed backward from the maximum penalty coefficient:

This formula uses the largest penalty cost (the dominant term in the objective) as a bound on how fast the value function can change per unit change in state. The upper bound at stage 1 then gives:

As iterations proceed, more vertices are collected and the inner approximation tightens. Both bounds converge to the same value, and the gap goes to zero.

Optimality Gap

The gap between the bounds measures how far the current policy may be from optimal:

A small gap means the cut approximation is close to the true value function and the policy is near-optimal. In practice, the gap is monitored over iterations and used by the stopping rules described below.

Stopping Rules

Cobre supports five stopping criteria, evaluated at the end of each iteration. Multiple rules can be combined with OR logic (stop when any rule triggers) or AND logic (stop when all rules trigger simultaneously).

Iteration Limit

A hard cap on the number of iterations. This rule must always be present as a safety bound:

Time Limit

A wall-clock time budget. Checked at the end of each iteration:

Bound Stalling

Detects when the lower bound has plateaued. The rule computes the relative improvement of over a sliding window of iterations:

If falls below the tolerance, further iterations are unlikely to improve the policy significantly.

Gap Convergence

Stops when the optimality gap falls below a threshold:

Requires an upper bound to be available (statistical or deterministic).

The most robust criterion combines bound stability with policy stability. Periodically (e.g., every 20 iterations), the rule:

  1. Checks bound stability:

  2. If stable, runs a batch of Monte Carlo simulations and computes the normalized distance between the per-stage cost profiles of consecutive simulation batches:

  3. Stops when both converge:

This avoids premature termination from statistical noise and ensures the policy is genuinely stable, not just the bound.

Termination Output

When the algorithm stops, the output reports which rule triggered, the final iteration count, the lower and upper bounds, and the optimality gap. This information supports post-hoc analysis of whether the policy is sufficiently converged or whether additional iterations would be beneficial.