Skip to content

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

@justinvyu

Description

@justinvyu

Quicklinks

User guide on configuring storage for Ray Train/Tune
User guide on checkpointing and how they interact with storage

Summary

Starting in Ray 2.7, Ray Train and Tune will require users to pass in a cloud storage or NFS path if running distributed training or tuning jobs.

In other words, Ray Train / Tune will no longer support the synchronization of checkpoints and other artifacts from worker nodes to the head node.

In Ray 2.6, syncing directories to the head node will no longer be the default storage configuration. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS.

Please leave any comments or concerns on this thread below -- we would be happy to better understand your perspective.

Code Changes

For single node Ray Train and Ray Tune experiments, this does not change anything or require any modifications to your code.

For multi-node Ray Train and Ray Tune experiments, you should switch to using one of the following persistent storage options:

  1. Cloud storage. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="s3://bucket-name/experiment_results",
)

# Use cloud storage in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to s3://bucket-name/experiment_results/experiment_name
  1. A network filesystem mounted on all nodes. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="/mnt/shared_storage/experiment_results",
)

# Use NFS in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to /mnt/shared_storage/experiment_results/experiment_name

If needed, you can re-enable this behavior by setting the environment variable: RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1

Background Context

In a multi-node Ray cluster, Ray Train and Ray Tune assume access to some form of persistent storage that stores outputs from all worker nodes. This includes files such as logged metrics, artifacts, and checkpoints.

Without some form of external shared storage (cloud storage or NFS):

  1. Ray AIR cannot restore a training run from the latest checkpoint for fault tolerance. Without saving a checkpoint to external storage, the latest checkpoint may not exist anymore, if the node that it was saved on has already crashed.
  2. You cannot access results after training has finished. If the Ray cluster has already been terminated (e.g., from automatic cluster downscaling), then the trained model checkpoints cannot be accessed if they have not been persisted to external storage.

Why are we removing support?

  1. Cloud storage and NFS are cheap, easy to set up, and ubiquitous in today's machine learning landscape.
  2. Syncing to the head node introduces major performance bottlenecks and does not scale to a large number of worker nodes or larger model sizes.
    1. The speed of communication is limited by the network bandwidth of a single (head) node, and with large models, disk space on the head node even becomes an issue.
    2. Generally, putting more load on the head node increases the risk of cluster-level failures.
  3. The maintenance burden of the legacy sync has become substantial. The ML team wants to focus on making the cloud storage path robust and performant, which is much easier without having to maintain two duplicate synchronization stacks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    trainRay Train Related Issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions