-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Quicklinks
User guide on configuring storage for Ray Train/Tune
User guide on checkpointing and how they interact with storage
Summary
Starting in Ray 2.7, Ray Train and Tune will require users to pass in a cloud storage or NFS path if running distributed training or tuning jobs.
In other words, Ray Train / Tune will no longer support the synchronization of checkpoints and other artifacts from worker nodes to the head node.
In Ray 2.6, syncing directories to the head node will no longer be the default storage configuration. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS.
Please leave any comments or concerns on this thread below -- we would be happy to better understand your perspective.
Code Changes
For single node Ray Train and Ray Tune experiments, this does not change anything or require any modifications to your code.
For multi-node Ray Train and Ray Tune experiments, you should switch to using one of the following persistent storage options:
- Cloud storage. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig
run_config = RunConfig(
name="experiment_name",
storage_path="s3://bucket-name/experiment_results",
)
# Use cloud storage in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)
# All experiment results will be persisted to s3://bucket-name/experiment_results/experiment_name- A network filesystem mounted on all nodes. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig
run_config = RunConfig(
name="experiment_name",
storage_path="/mnt/shared_storage/experiment_results",
)
# Use NFS in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)
# All experiment results will be persisted to /mnt/shared_storage/experiment_results/experiment_nameIf needed, you can re-enable this behavior by setting the environment variable: RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
Background Context
In a multi-node Ray cluster, Ray Train and Ray Tune assume access to some form of persistent storage that stores outputs from all worker nodes. This includes files such as logged metrics, artifacts, and checkpoints.
Without some form of external shared storage (cloud storage or NFS):
- Ray AIR cannot restore a training run from the latest checkpoint for fault tolerance. Without saving a checkpoint to external storage, the latest checkpoint may not exist anymore, if the node that it was saved on has already crashed.
- You cannot access results after training has finished. If the Ray cluster has already been terminated (e.g., from automatic cluster downscaling), then the trained model checkpoints cannot be accessed if they have not been persisted to external storage.
Why are we removing support?
- Cloud storage and NFS are cheap, easy to set up, and ubiquitous in today's machine learning landscape.
- Syncing to the head node introduces major performance bottlenecks and does not scale to a large number of worker nodes or larger model sizes.
- The speed of communication is limited by the network bandwidth of a single (head) node, and with large models, disk space on the head node even becomes an issue.
- Generally, putting more load on the head node increases the risk of cluster-level failures.
- The maintenance burden of the legacy sync has become substantial. The ML team wants to focus on making the cloud storage path robust and performant, which is much easier without having to maintain two duplicate synchronization stacks.