Optimal Checkpointing Frequency

Checkpointing Frequency

When training large-scale models on distributed systems, hardware failures are inevitable. The question isn’t if a failure will occur, but when—and how much work you’ll lose when it does. Checkpoint too frequently and you waste compute on I/O overhead; checkpoint too rarely and you risk losing hours of training progress to a single node failure.

The Young/Daly model provides an analytical solution to this tradeoff, optimizing checkpointing frequency to maximize training goodput. The optimal frequency nfreq can be found by:

nfreq,opt=Tc,opttstep=2(tmtbi+tmttr)tckpttstep

For example, given the following settings:

  • Mean time between interruptions tmtbi: 6000s
  • Mean time to recover tmttr: 600s
  • Time to checkpoint: 10s
  • Train step time: 15s

The optimal checkpointing frequency is ~24 steps per checkpoint.

This means that for this configuration, checkpointing every 24 training steps (every 6 minutes of training) maximizes your effective throughput—balancing the cost of checkpointing against the risk of losing work to failures.


Derivation: Why This Formula Works

The key insight is that we want to maximize effective throughput—not just raw training speed, but the rate at which we make durable progress. To do this, we need to maximize the efficiency (E) of the system, which is the ratio of useful work time to total time:

E=TgoodTtotal=TgoodTgood+Toverhead

The total overhead comes from three sources:

  • Checkpointing Overhead (Tckpt): Time spent saving checkpoints to persistent storage.
  • Wasted Work Overhead (Twaste): Time spent re-computing lost work after a failure.
  • Downtime Overhead (Tdown): Time spent recovering from failures (tmttr).

We model this by analyzing the total time required to successfully complete one “block” of useful work, where Tc=nfreqtstep represents the training time between checkpoints.

Step 1: Time for one block (no failures)

The time to complete the work and save it is Tblock=Tc+tckpt.

Step 2: Probability of failure during a block

The probability (Pfail) of an incident occurring during this Tblock time is PfailTblocktmtbi. This assumes tmtbiTblock, which is typically true in practice.

Step 3: Cost of a failure

If a failure occurs, the total time cost includes:

  • The downtime to recover: tmttr
  • The average work that must be re-done: Since the failure can happen at any point during the interval, the expected lost training time is Tc/2.
  • Total Cost = tmttr+Tc2

Step 4: Total Time (Ttotal) to complete one block

This is the time for the block itself, plus the expected cost of failure:

Ttotal=Tblock+(PfailTotal Cost)

Ttotal(Tc+tckpt)+(Tc+tckpttmtbi)(tmttr+Tc2)

Step 5: Efficiency (E)

The efficiency is the useful work (Tc) divided by the total time (Ttotal):

E(Tc)=Tc(Tc+tckpt)[1+1tmtbi(tmttr+Tc2)]

Step 6: Optimization

To find the optimal Tc that maximizes efficiency, we take the derivative dE/dTc, set it to zero, and solve for Tc. The calculus simplifies elegantly to:

Tc,opt2=2(tmtbi+tmttr)tckpt