Flyte’s workflow model provides implicit checkpoints at task boundaries — if one task fails, only that task retries. However, when a single task runs a long computation (such as a training loop over many epochs), you may want checkpoints within the task itself. Intratask checkpointing lets you save progress as a file during task execution and resume from the latest checkpoint on retry, rather than restarting from scratch.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/flyteorg/flyte/llms.txt
Use this file to discover all available pages before exploring further.
Why intratask checkpoints?
Spot and preemptible instances
AWS spot and GCP preemptible instances can be reclaimed at any time. Checkpointing makes long tasks resilient to interruption at a fraction of the cost.
Long training loops
ML model training over many epochs or large datasets is expensive to restart. Checkpoint each epoch to resume from the last completed one.
Avoid task fan-out overhead
Breaking a loop into individual tasks via dynamic workflows adds overhead per iteration. A single checkpointed task avoids that cost.
Tight computation loops
Some computations are logically a single unit but run long enough to need fault tolerance. Checkpointing bridges both requirements.
How it works
Flytekit exposes a checkpoint API through the execution context:checkpoint object provides:
cp.write(bytes)— save bytes to a checkpoint filecp.read()— read the latest checkpoint bytes (returnsNoneif no checkpoint exists)
cp.read() returns the bytes written during the previous attempt, allowing the task to resume from where it left off.
Example: checkpointed iteration task
Import required libraries and configure retries:n_iterations times, checkpoints its progress, and can recover from simulated failures:
Only
FlyteRecoverableException triggers a retry. Other exceptions cause the task to fail immediately without retry.Execution flow on the Flyte cluster
First attempt
The task starts, checkpoints progress through iterations 0–4, then raises
FlyteRecoverableException at iteration 5.Retry
Flyte retries the task. On startup,
cp.read() returns "5" (the last checkpoint), so the task resumes from iteration 5 rather than 0.Using interruptible tasks with checkpoints
Mark a task asinterruptible=True to allow Flyte to schedule it on spot or preemptible instances. Combined with checkpointing, this gives you fault-tolerant execution at reduced cost: