Optimal Checkpointing Frequency

Checkpointing Frequency

When training large-scale models on distributed systems, hardware failures are inevitable. The question isn’t if a failure will occur, but when—and how much work you’ll lose when it does. Checkpoint too frequently and you waste compute on I/O overhead; checkpoint too rarely and you risk losing hours of training progress to a single node failure.

The Young/Daly model provides an analytical solution to this tradeoff, optimizing checkpointing frequency to maximize training goodput. The optimal frequency $n_{freq}$ can be found by:

$$ n_{freq, opt} = \frac{T_{c, opt}}{t_{step}} = \frac{\sqrt{2 \cdot (t_{mtbi} + t_{mttr}) \cdot t_{ckpt}}}{t_{step}} $$

For example, given the following settings:

Mean time between interruptions $t_{mtbi}$: 6000s
Mean time to recover $t_{mttr}$: 600s
Time to checkpoint: 10s
Train step time: 15s

The optimal checkpointing frequency is ~24 steps per checkpoint.

This means that for this configuration, checkpointing every 24 training steps (every 6 minutes of training) maximizes your effective throughput—balancing the cost of checkpointing against the risk of losing work to failures.

Derivation: Why This Formula Works

The key insight is that we want to maximize effective throughput—not just raw training speed, but the rate at which we make durable progress. To do this, we need to maximize the efficiency ($E$) of the system, which is the ratio of useful work time to total time:

$$E = \frac{T_{good}}{T_{total}} = \frac{T_{good}}{T_{good} + T_{overhead}}$$

The total overhead comes from three sources:

Checkpointing Overhead ($T_{ckpt}$): Time spent saving checkpoints to persistent storage.
Wasted Work Overhead ($T_{waste}$): Time spent re-computing lost work after a failure.
Downtime Overhead ($T_{down}$): Time spent recovering from failures ($t_{mttr}$).

We model this by analyzing the total time required to successfully complete one “block” of useful work, where $T_c = n_{freq} \cdot t_{step}$ represents the training time between checkpoints.

Step 1: Time for one block (no failures)

The time to complete the work and save it is $T_{block} = T_c + t_{ckpt}$.

Step 2: Probability of failure during a block

The probability ($P_{fail}$) of an incident occurring during this $T_{block}$ time is $P_{fail} \approx \frac{T_{block}}{t_{mtbi}}$. This assumes $t_{mtbi} \gg T_{block}$, which is typically true in practice.

Step 3: Cost of a failure

If a failure occurs, the total time cost includes:

The downtime to recover: $t_{mttr}$
The average work that must be re-done: Since the failure can happen at any point during the interval, the expected lost training time is $T_c / 2$.
Total Cost = $t_{mttr} + \frac{T_c}{2}$

Step 4: Total Time ($T_{total}$) to complete one block

This is the time for the block itself, plus the expected cost of failure:

$$T_{total} = T_{block} + (P_{fail} \cdot \text{Total Cost})$$

$$T_{total} \approx (T_c + t_{ckpt}) + \left( \frac{T_c + t_{ckpt}}{t_{mtbi}} \right) \cdot \left(t_{mttr} + \frac{T_c}{2}\right)$$

Step 5: Efficiency ($E$)

The efficiency is the useful work ($T_c$) divided by the total time ($T_{total}$):

$$E(T_c) = \frac{T_c}{(T_c + t_{ckpt}) \cdot \left[ 1 + \frac{1}{t_{mtbi}} \left(t_{mttr} + \frac{T_c}{2}\right) \right]}$$

Step 6: Optimization

To find the optimal $T_c$ that maximizes efficiency, we take the derivative $dE/dT_c$, set it to zero, and solve for $T_c$. The calculus simplifies elegantly to:

$$T_{c, opt}^2 = 2 \cdot (t_{mtbi} + t_{mttr}) \cdot t_{ckpt}$$