Name	Name	Last commit message	Last commit date
parent directory ..
architecture	architecture
aws/391835788720/us-east-1	aws/391835788720/us-east-1
devutils	devutils
docker	docker
modules	modules
scripts	scripts
.gitignore	.gitignore
Makefile	Makefile
README.md	README.md
Terrafile	Terrafile
make.env	make.env
requirements.txt	requirements.txt

ARC runners config for Pytorch

Dependencies

This project depends on:

python 3.10
virtualenv
aws cli
terraform
kubectl cli
helm cli
CMake
1Password CLI

Design

It creates a VPC and a EKS cluster. On that it then setups the Github first party ARC solution for GHA runners using helm

Setup

In order to deploy, you'll need to setup the AWS CLI and 1Password CLI

AWS CLI Setup

Get an AWS account. You may need to contact someone with admin access to send you an invite
Ensure 2FA is setup on your AWS account
Install the AWS CLI
To Auth into the AWS CLI, get a new AWS Access Key ID and Secret Access Key. On the AWS console go to IAM->Users->Your user->Security credentials->Create access key.
In your terminal, run aws configure --profile {account} to setup your login (currently {account} is always 391835788720). It'll ask you for the AWS access key id and secret access key from the previous step. For default region name say us-east-1. For default output format say json.
1. This will setup your .aws folder and create config and credentials files in there.
2. Here we have AWS CLI set up with a profile named with the account where the target will be deployed (based on the path for each account module on aws/<acc-id>/<region>/) with all the permissions and keys set up locally.
To use the above config as your default setup, you can run aws configure a second time, but without the --profile param.
Run aws ec2 describe-instances to verify that you're properly authenticated.

1Password setup

You need 1Password to fetch environment secrets and pass them to make.

Create a 1Password account. Linux Foundation owns 1Password. Ask teammember from there to invite you to create a 1Password account
Install and setup the 1Password CLI as per their docs.

The root folder's make.env contains paths to various secrets defined in 1Password. To actually use those secrets, you'll want to prefix any command you run with op run --env-file make.env -- [YOUR_COMMAND]. This is particularly important for the make commands.

You can see what your combined your environment contains by running op run --env-file make.env -- env

Invoke 1Password more easily

You can add the following function to your .bashrc or .zshrc file to simplify adding the op prefix. It'll traverse up the tree to find the first file named make.env and pass the path to that into op.

# Alias the 1Password cli. This is for the ci-tools repo. See https://support.1password.com/command-line-getting-started/
# It makes calling "op make" equivalent to "op run --env-file PATH_TO_make.env -- make"
op() {
    command op run --env-file $(file="make.env"; pushd . > /dev/null 2>&1; while [[ "$PWD" != "/" && ! -e "$file" ]]; do cd ..; done; if [[ -e "$file" ]]; then echo "$PWD/$file"; fi; popd > /dev/null 2>&1) -- "$@"
}

Python setup

Ensure you have python 3.10 installed.

Terraform Lint setup

Optional, but this lets you run terraform lint locally via make tflint

Setup Terraform

Install Terraform: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
Enable tab completion on bash/zsh: terraform -install-autocomplete (optional)

Instal TFLint

Instructions: https://github.com/terraform-linters/tflint

Run tflint using the op prefix: op run --env-file make.env -- make tflint

Or if you setup the shortcut function, you can run op make tflint

Install eksctl

Instructions: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-eksctl.html

Authenticate to the Canary clusters

To get authenticated to the kubernetes cluster, you need to be added to the list of authorized users: the EKS list. It's a somewhat complicated process:

Get the current EKS_USERS list from 1Password. It's stored as a base64 string.
Use base64 to decode it. (e.g. echo "the_string" | base64 --decode > users.txt).
Add a line for yourself to the resulting users list
Encode the new list back to base64: base64 -i user.txt
Replace the old EKS_USERS value in 1Password with this new value
Go to the Github secrets for this repo. Replace the EKS_USERS secret with this base64 encoded value
Get someone who already has access to the clusters to run CLUSTER_TARGET=[cluster-you-want] make clean apply-arc-canary to actually propate your name to the relevant clusters

Now you can deploy bits to the cluster

Deploy

Once the above setup steps are complete you can run make as follows:

$ op run --env-file make.env -- make

Or if invoking make from a different folder, pass a path to the make.env file:

# If invoking from aws/<acc-id>/<region>
$ op run --env-file ../../../make.env -- make

Debug/develop

If you're testing changes in packages and want to force make to install newer dependencies, just trigger a make clean, it should remove any installed dependency or package locally in the project;

It can be the case that kubectl/helm fail to detect changes in some situations, except from fixing it up and submiting a PR to it and wait to the newer version, you have the option to delete some K8s setup in order to force-replace with make delete

There are canary environments to help develop, to update terraform in all canary environments:

$ cd aws/<acc-id>/<region>
$ op run --env-file ../../../make.env -- make apply-arc-canary

There are 3 canary environments and they can be deployed in steps, the variable CLUSTER_TARGET is optional and used to specify one of the environments:

# installs/update docker registry and mirrors
$ cd aws/<acc-id>/<region>
$ CLUSTER_TARGET="ghci-arc-c-runners-eks-I" op run --env-file ../../../make.env -- make install-docker-registry-canary

# installs/update karpenter and node config
$ cd aws/<acc-id>/<region>
$ CLUSTER_TARGET="ghci-arc-c-runners-eks-I" op run --env-file ../../../make.env -- make karpenter-autoscaler-canary

# installs/update ARC and runner config
$ cd aws/<acc-id>/<region>
$ CLUSTER_TARGET="ghci-arc-c-runners-eks-I" op run --env-file ../../../make.env -- make k8s-runner-scaler-canary

# do it all inside K8s
$ cd aws/<acc-id>/<region>
$ CLUSTER_TARGET="ghci-arc-c-runners-eks-I" op run --env-file ../../../make.env -- make arc-canary

In order to save resources, by default in the canary cluster the minimum number of runners are set to 0 for all runner types. But if other values are needed in order to conduct testing, it is possible to set this number to any other value by setting the variable CANARY_MIN_RUNNERS:

$ CANARY_MIN_RUNNERS=1 CLUSTER_TARGET="ghci-arc-c-runners-eks-I" op run --env-file ../../../make.env -- make k8s-runner-scaler-canary

Upgrading EKS clusters

To upgrade EKS clusters to a new version:

Go to the AWS Console (https://us-east-1.console.aws.amazon.com/eks/home?region=us-east-1#/clusters)
For the Cluster(s) you wish to upgrade delete the node groups associated with them
Delete the Cluster
Run make apply # more specifically apply-canary apply-vanguard apply-prod

Release to Prod

To release the latest main branch to prod do the following:

Trigger the "Runners Open Release PR" workflow to create a release PR. This PR will be used to manage the deployment
Once that PR is ready, get it approved by teammates
On that PR, comment PROCEED_TO_VANGUARD to deploy the bits to Vanguard (our staging environment)
Once vanguard has ben successfully deployed, comment PROCEED_TO_PRODUCTION to deploy to prod
Once that's done, finally comment CLEANUP_DEPLOYMENT to finish up the deployment

Project organization decision

On the path starting with aws/ everything that is considered critical and secret should be placed. The idea is that all the other paths could be OpenSourced and any config that is only specific for the cluster being deployed or the account being managed for the responsible team should be placed there. Eventually those configs should be broken into different repositories. Enabling collaborators to reuse the project in a modular approach.

Update/Upgrade/Deploy monitoring infra

The monitoring infrastructure (except scrappers) is deployed in a separate cluster and is not integrated with the current Makefile targets nor is integrated in the rollout procedure. The reasoning is that it is assumed that there won't be required frequent updates in it and that deploying both in symultaneously can create problems of becoming blind right when monitoring is the most important for infra. So, to update, after communicating with everyone on slack, get a pair programming session with another person in the team and run:

$ cd aws/391835788720/us-east-1 && make clean && op run --env-file ../../../make.env -- make apply-arc-canary-monitoring arc-canary-monitoring apply-arc-prod-monitoring arc-prod-monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ARC runners config for Pytorch

Dependencies

Design

Setup

AWS CLI Setup

1Password setup

Invoke 1Password more easily

Python setup

Terraform Lint setup

Setup Terraform

Instal TFLint

Install eksctl

Authenticate to the Canary clusters

Deploy

Debug/develop

Upgrading EKS clusters

Release to Prod

Project organization decision

Update/Upgrade/Deploy monitoring infra

FilesExpand file tree

arc-backup-2024

Directory actions

More options

Directory actions

More options

Latest commit

History

arc-backup-2024

Folders and files

parent directory

README.md

ARC runners config for Pytorch

Dependencies

Design

Setup

AWS CLI Setup

1Password setup

Invoke 1Password more easily

Python setup

Terraform Lint setup

Setup Terraform

Instal TFLint

Install eksctl

Authenticate to the Canary clusters

Deploy

Debug/develop

Upgrading EKS clusters

Release to Prod

Project organization decision

Update/Upgrade/Deploy monitoring infra