The Flyte KFPyTorch plugin dispatchesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/flyteorg/flyte/llms.txt
Use this file to discover all available pages before exploring further.
@task(task_config=PyTorch(...)) tasks to the Kubeflow training-operator, which manages PyTorchJob Kubernetes resources for distributed training.
Prerequisites
- A running Kubernetes cluster with Flyte installed
helmandkubectlconfigured- GPU nodes (optional but typical for training workloads)
Step 1: Install the Kubeflow training-operator
The training-operator also supports TensorFlow (
TFJob), MPI (MPIJob), XGBoost (XGBoostJob), and Paddle (PaddleJob). Installing it once enables all these job types.Step 2: Enable the PyTorch plugin in Flyte
Create avalues-pytorch.yaml override file:
- flyte-binary
- flyte-core
Step 3: Write a distributed PyTorch task
Install the flytekit PyTorch plugin:Single-worker PyTorch task
Multi-worker distributed training
Elastic training (torchrun)
Gang scheduling (optional)
For distributed training jobs, all worker pods must be scheduled simultaneously to avoid deadlocks. Enable gang scheduling using one of:Kubernetes scheduler plugins (co-scheduling)
Apache YuniKorn
Verify
Troubleshooting
Workers stuck in Pending state
Workers stuck in Pending state
Common causes:
- Insufficient GPU nodes — check
kubectl get nodes -l cloud.google.com/gke-acceleratoror equivalent - Missing GPU tolerations — add GPU tolerations to your default PodTemplate
- Missing NVIDIA device plugin — install
nvidia-device-pluginas a DaemonSet
NCCL communication errors
NCCL communication errors
NCCL requires that all worker pods can communicate directly. Ensure:
- No NetworkPolicy blocks pod-to-pod communication within the namespace
- Pods can resolve each other’s hostnames via the headless Service created by the training-operator
Job timeout
Job timeout
Set a
activeDeadlineSeconds in the run_policy to prevent stuck jobs from consuming resources indefinitely: