If you created your VM using the raysort AMI, you should already have a Conda environment ready. Otherwise, install Miniconda3 and run conda create -n raysort python=3.9.13. Then run:
conda activate raysort
pip install -Ur requirements/dev.txt
pip install -Ur requirements/worker.txt
pip install -e .
pushd raysort/sortlib && python setup.py build_ext --inplace && popd
scripts/installers/install_binaries.sh
You can use an existing S3 bucket or create one using the provided terraform script as follows.
- Install Terraform:
scripts/installers/install_terraform.sh - Navigate to
scripts/config/terraform/aws-s3-template - Run
terraform init - Run
terraform apply -var="bucket_count=1" -var="bucket_prefix=YOUR_PREFIX_HERE". ReplaceYOUR_PREFIX_HEREwith a bucket name prefix. - Your bucket should be created with name similar to
YOUR_PREFIX_HERE-000.
Edit .envrc and change USERNAME and S3_BUCKET to your own. Set up direnv so that the .envrc files are sourced automatically when you cd into a directory. Otherwise, manually source .envrc.
A good first step to sanity check your setup is to run raysort on a single node with:
CONFIG=LocalNative python raysort/main.py
It should complete without errors with an All OK! message. The detailed configuration is in config.py.
- Install Terraform:
scripts/installers/install_terraform.sh - Run
export CONFIG=1tb-1gb-s3-native-s3 && python scripts/cls.py up --rayto launch a Ray cluster, or--yarnto launch a YARN cluster for Spark - Run a test run on the cluster:
python raysort/main.py 2>&1 | tee main.log
The 1tb-1gb-s3-native-s3 config launches 10 r6i.2xlarge nodes, and runs a 1TB sort with 1GB partitions using S3 for I/O and for shuffle spilling.
scripts/cls.py is the centralized place for cluster management code.
python scripts/cls.py uplaunches a cluster via Terraform and configures it via Ansible. Add--rayor--yarnto start a Ray or a YARN cluster.python scripts/cls.py setupskips Terraform and only runs Ansible for software setup. Add--rayor--yarnto start a Ray or a YARN cluster.python scripts/cls.py downterminates the cluster via Terraform. Tip: when you're done for the day, runpython scripts/cls.py down && sudo shutdown -h nowto terminate the cluster and stop your head node.python scripts/cls.py start/stop/rebootcalls the AWS CLI tool to start/stop/reboot all your machines in the cluster. Useful when you want to stop the cluster but not terminate the machines.
While scripts/cls.py uses Terraform to manage the cluster, scripts/autoscaler.py uses the Ray autoscaler to manage the cluster.
python scripts/autoscaler.py up -ylaunches a cluster via the Ray autoscaler.python scripts/autoscaler.py submit scripts/main.pysubmits a job to be executed by the Ray autoscaler that has been launched.python scripts/autoscaler.py down -yterminates the cluster.
- All of Ray's system configuration parameters can be found in
ray_config_defs.h. - You only need to specify the config on the head node. All worker nodes will use the same config.
- There are two ways to specify a configuration value. Suppose you want to set
min_spilling_sizeto 0, then:- You can set it in Python, where you do
ray.init(..., _system_config={"min_spilling_size": 0, ...}) - You can set it in the environment variable by running
export RAY_min_spilling_size=0before running yourray startcommand or your Python program that callsray.init(). This is preferred as our experiment tracker will automatically pick up these environment variables and log them in the W&B trials. Again, it suffices to only set this environment variable on the head node.
- You can set it in Python, where you do
Useful Ray environment variables:
# Enable debug logging for all raylets and workers
export RAY_BACKEND_LOG_LEVEL=debug- Create a new volume in the same region as your machine on the AWS Dashboard.
- Attach it to your machine and format it as follows:
- Create a partition by running
sudo parted /dev/nvme1n1, wherenvme1n1is the device name which you can find withlsblk. - If a partition table does not exist, create it with
mklabel gpt. - Run
mkpart part0 ext4 0% 100%. Make sure no warnings appear. - Exit
partedand runsudo mkfs.ext4 /dev/nvme1n1p1. Note the extrap1. - Run
sudo mount -o sync path_to_volume /mnt/data0. Only use-o syncif you are running microbenchmarks.
- Create a partition by running
- Verify that the mounting worked with
lsblk.- If the desired volume is not mounted, edit
/etc/fstabto remove any conflicting lines. Then, restart your machine and remount.
- If the desired volume is not mounted, edit
- After launching a cluster via
cls.py up, forward port 3000 from the head node to your laptop, then go to http://localhost:3000/. - Default login is username
adminand passwordadmin. - Add a new data source here: http://localhost:3000/datasources/new. Select
Prometheus, set URL to behttp://localhost:9090, then selectSave & Test. - Import the dashboard here: http://localhost:3000/dashboard/import. Use the JSON file here: https://github.com/franklsf95/raysort/tree/master/scripts/config/grafana/Exoshuffle-AWS.json. Select the Prometheus data source in Options.
- You should see CPU, memory, disk, network activities in the dashboard. The Application Progress only shows up when you are running Exoshuffle jobs.
Make sure the Conda environment is running. Install missing packages with:
pip install package_name
Install AWS's CLI and set credentials with:
pip install awscli
aws configure
Verify that the image the nodes are being created from matches expectations.
This image raysort-worker-20230111 is currently being used.
First, try manually connect: ssh -i ~/.aws/login-us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null <worker_ip>. If this doesn't work, it's likely because your current VM is not in the same security group as the worker nodes (which are in the default security group). The easiest solution is to find your instance on the AWS EC2 UI, right click "Security -> Change Security Groups", and add your instance to the default security group. TODO: this might be possible to automate in Terraform.
Code for testing disk bandwidth.
sudo fio --directory=. --ioengine=psync --name fio_test_file --direct=1 --rw=write --bs=1m --size=1g --numjobs=8
sudo fio --directory=. --ioengine=psync --name fio_test_file --direct=1 --rw=read --bs=1m --size=1g --numjobs=8