ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance
ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#
- AlgorithmConfig.fault_tolerance(recreate_failed_workers: Optional[bool] = <ray.rllib.utils.from_config._NotProvided object>, max_num_worker_restarts: Optional[int] = <ray.rllib.utils.from_config._NotProvided object>, delay_between_worker_restarts_s: Optional[float] = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: Optional[bool] = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_worker_failures_tolerance: Optional[int] = <ray.rllib.utils.from_config._NotProvided object>, worker_health_probe_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>, worker_restore_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>)[source]#
Sets the config’s fault tolerance settings.
- Parameters
recreate_failed_workers – Whether - upon a worker failure - RLlib will try to recreate the lost worker as an identical copy of the failed one. The new worker will only differ from the failed one in its
self.recreated_worker=True
property value. It will have the sameworker_index
as the original one. If True, theignore_worker_failures
setting will be ignored.max_num_worker_restarts – The maximum number of times a worker is allowed to be restarted (if
recreate_failed_workers
is True).delay_between_worker_restarts_s – The delay (in seconds) between two consecutive worker restarts (if
recreate_failed_workers
is True).restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the RolloutWorker crashing.
num_consecutive_worker_failures_tolerance – The number of consecutive times a rollout worker (or evaluation worker) failure is tolerated before finally crashing the Algorithm. Only useful if either
ignore_worker_failures
orrecreate_failed_workers
is True. Note that forrestart_failed_sub_environments
and sub-environment failures, the worker itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.worker_health_probe_timeout_s – Max amount of time we should spend waiting for health probe calls to finish. Health pings are very cheap, so the default is 1 minute.
worker_restore_timeout_s – Max amount of time we should wait to restore states on recovered worker actors. Default is 30 mins.
- Returns
This updated AlgorithmConfig object.