This repository is a snapshot of Exoshuffle-CloudSort, the winning entry of the 2022 CloudSort Benchmark in the Indy category.
To run Exoshuffle-CloudSort, you will need:
- AWS credentials with access to EC2 and S3
- A head node of size
r6i.2xlarge - 40 empty Amazon S3 buckets (you can use the Terraform template to create them)
The easiest way to setup the head node is to launch it with the provide image raysort-worker-20230111. Alternatively, install Python 3.9.13 with Anaconda, then run:
pip install -Ur requirements/dev.txt
pip install -Ur requirements/worker.txt
pip install -e .
pushd cloudsort/sortlib && python setup.py build_ext --inplace && popd
scripts/installers/install_binaries.shEdit .envrc and change USER and S3_BUCKET to your own. Set up direnv so that the .envrc files are sourced automatically when you cd into a directory. Otherwise, manually source .envrc.
The easiest way to start up a cluster of worker nodes is by using the cls.py script, which launches VMs using Terraform and sets them up using Ansible. Some values are hardcoded for our experiments, but generally it should run with few changes. If something does not work, file an issue.
- Install Terraform:
scripts/installers/install_terraform.sh - Run
export CONFIG=2tb-2gb-i4i4x-s3 && python scripts/cls.py up --rayto launch a Ray cluster - Run a test run on the cluster:
python cloudsort/main.py 2>&1 | tee main.log
The 2tb-2gb-i4i4x-s3 config launches 10 i4i.4xlarge nodes, and runs a 1TB sort with 2GB partitions using 10 S3 buckets for I/O. The expected sorting time is around 400 seconds.
To run the 100TB CloudSort benchmark, use the following command:
export STEPS= && export CONFIG=100tb-2gb-i4i4x-s3 && python scripts/cls.py up --ray && python cloudsort/main.py 2>&1 | tee main.logIf STEPS is empty, the program will run all three steps: generate input, sort, and validate output. You can also specify the steps to run, e.g. STEPS=sort,validate_output. The expected sorting time is around 5400 seconds.
You can get runtime metrics using Prometheus and Grafana.
scripts/cls.py is the centralized place for cluster management code.
python scripts/cls.py uplaunches a cluster via Terraform and configures it via Ansible. Add--rayor--yarnto start a Ray or a YARN cluster.python scripts/cls.py setupskips Terraform and only runs Ansible for software setup. Add--rayor--yarnto start a Ray or a YARN cluster.python scripts/cls.py downterminates the cluster via Terraform. Tip: when you're done for the day, runpython scripts/cls.py down && sudo shutdown -h nowto terminate the cluster and stop your head node.python scripts/cls.py start/stop/rebootcalls the AWS CLI tool to start/stop/reboot all your machines in the cluster. Useful when you want to stop the cluster but not terminate the machines.