RAY: Distributed Computing Framework

Ray is an open-source, high-performance distributed execution framework primarily designed for scalable and parallel Python and machine learning applications. It enables developers to easily scale Python code from a single machine to a cluster without needing to change much code. Ray is especially used in ML workloads like hyperparameter tuning, training and serving at scale.

Features-of-RAY_-A-Distributed-Computing-Framework — Key Features of RAY

Key Features of Ray

Easy-to-Use API: Ray provides decorators like @ray.remote to parallelize tasks with minimal code changes. It abstracts away low-level distributed computing complexity from developers. The Python-first design makes it highly accessible for ML practitioners and data scientists.
Scalable Parallelism: Ray supports fine-grained and coarse-grained parallelism, enabling many concurrent tasks. It can efficiently distribute tasks across cores and machines in a cluster. It works seamlessly across CPUs, GPUs and TPUs.
Built-in Libraries: Ray includes ML-centric libraries like Tune, RLlib, Serve and Train. These help in hyperparameter tuning, reinforcement learning, model training and deployment. All libraries are fully distributed and optimized for performance.
Fault Tolerance & Checkpointing: Ray automatically retries failed tasks and supports checkpointing. This ensures long-running jobs can recover without restarting. It’s highly reliable for mission-critical workloads.
Support for Multiple Backends: Ray integrates with various frameworks like PyTorch, TensorFlow, Scikit-learn. It supports distributed actors and tasks on top of cloud-native or on-prem systems. It also works well with Docker, Kubernetes and cloud platforms like AWS/GCP.

Installation of RAY

To install Ray, follow the following steps:

First, we install the Ray library in our environment.

!pip install ray

Now, import the library to use the functions and utilize Ray for various use cases.

import ray

Architecture of RAY

1. Ray Cluster: A Ray cluster consists of a head node and multiple worker nodes. The head node manages metadata and task scheduling. Worker nodes execute the actual tasks and return results.

2. Ray Driver: The driver is your Python script that submits tasks to the Ray cluster. It interacts with the Ray runtime to create and manage remote functions and actors. It can be run on the head node or any external node.

3. Ray Scheduler: Ray uses a global control store and object store to manage task scheduling. It tracks task dependencies and assigns tasks to available resources. The scheduler is centralized in logic but distributed in operation.

4. Tasks and Actors:

Tasks: Stateless functions run remotely with @ray.remote.
Actors: Stateful objects that maintain local state across multiple method calls.
Both can run concurrently and independently across nodes.

5. Object Store: The object store stores results of remote tasks (e.g., futures). It ensures zero-copy data sharing and fast inter-process communication. Each node has a local object store connected via plasma.

Workflow of a Ray Program

Start Ray Cluster: Start Ray with ray.init() locally or connect to a cluster using ray.init(address="..."). The head node initializes runtime; worker nodes register themselves. Logs and dashboards start for monitoring.
Define Remote Functions: Use @ray.remote decorator for any function you want to run in parallel. Ray automatically serializes and distributes function calls. Functions return ObjectRef, a future-like handle to results.
Execute Tasks in Parallel: Call remote functions using .remote() and gather results with ray.get(). Use multiple calls to .remote() to fan out parallel computations. This is useful for batch processing or simulations.
Use Actors for Stateful Tasks: Define classes with @ray.remote to maintain state between invocations. Actors are ideal for environments, services or simulations. Each actor runs on a separate process/machine and is managed by Ray.
Monitor Execution and Logs: Use Ray Dashboard to view task graphs, memory and CPU/GPU usage. Helps track performance and debug issues in distributed applications. Ray also logs task failures and retries automatically.

Some Important Ray Libraries

Ray Tune: Facilitates distributed hyperparameter search and Supports grid, random and Bayesian optimization.
Ray Serve: Scalable and fast model serving framework and Deploys ML models as REST APIs or microservices.
Ray Train: Simplifies distributed deep learning training and Scales training across multiple CPUs/GPUs.
Ray RLlib: A high-level library for scalable RL algorithms. Includes support for multi-agent and offline RL. Optimized for large-scale training across clusters.
Modin: Drop-in replacement for pandas using Ray backend. Achieves faster computation on large datasets. Enables parallelized DataFrame operations easily.

Applications of Ray

Hyperparameter Tuning: Efficiently explores hyperparameter space with distributed trials. Used in AutoML systems and model optimization. Integrates with MLflow, WandB and Optuna.
Reinforcement Learning: Ray RLlib enables training on massive RL environments at scale. Supports distributed simulation and parallel policy optimization. Used in robotics, games and industrial control.
Data Preprocessing at Scale: Modin accelerates pandas workflows on Ray. Use the same pandas syntax while scaling to cluster memory. Used in data cleaning, transformation and EDA.
Online Model Serving: Ray Serve enables real-time model inference with REST APIs. Ideal for microservice-style deployment of ML models. Supports batching, dynamic scaling and A/B testing.
Large-scale Simulations: Ideal for scientific computing, simulations and Monte Carlo methods. Runs thousands of tasks concurrently for stochastic models. Used in finance, weather prediction and supply chain analysis.