Optimal Checkpointing Frequency

When training large-scale models on distributed systems, hardware failures are inevitable. The question isn’t if a failure will occur, but when—and how much work you’ll lose when it does. Checkpoint too frequently and you waste compute on I/O overhead; checkpoint too rarely and you risk losing hours of training progress to a single node failure.
The Young/Daly model provides an analytical solution to this tradeoff, optimizing checkpointing frequency to maximize training goodput. The optimal frequency
For example, given the following settings:
- Mean time between interruptions
: 6000s - Mean time to recover
: 600s - Time to checkpoint: 10s
- Train step time: 15s
The optimal checkpointing frequency is ~24 steps per checkpoint.
This means that for this configuration, checkpointing every 24 training steps (every 6 minutes of training) maximizes your effective throughput—balancing the cost of checkpointing against the risk of losing work to failures.
Derivation: Why This Formula Works
The key insight is that we want to maximize effective throughput—not just raw training speed, but the rate at which we make durable progress. To do this, we need to maximize the efficiency (
The total overhead comes from three sources:
- Checkpointing Overhead (
): Time spent saving checkpoints to persistent storage. - Wasted Work Overhead (
): Time spent re-computing lost work after a failure. - Downtime Overhead (
): Time spent recovering from failures ( ).
We model this by analyzing the total time required to successfully complete one “block” of useful work, where
Step 1: Time for one block (no failures)
The time to complete the work and save it is
Step 2: Probability of failure during a block
The probability (
Step 3: Cost of a failure
If a failure occurs, the total time cost includes:
- The downtime to recover:
- The average work that must be re-done: Since the failure can happen at any point during the interval, the expected lost training time is
. - Total Cost =
Step 4: Total Time (
This is the time for the block itself, plus the expected cost of failure:
Step 5: Efficiency (
The efficiency is the useful work (
Step 6: Optimization
To find the optimal