Skip to content

Latest commit

 

History

History
495 lines (415 loc) · 17.4 KB

filesystem.mdx

File metadata and controls

495 lines (415 loc) · 17.4 KB
layout page_title description
docs
Allocation Filesystems
Learn how Nomad uses allocation working directories to store job task templates, storage volumes, artifacts, dispatch payloads, and logs. Review image and chroot isolation, as well as when Nomad uses isolation mode.

Allocation Filesystems

This page provides conceptual information about how Nomad uses allocation working directories to store job task templates, storage volumes, artifacts, dispatch payloads, and logs. Review image and chroot isolation, as well as when Nomad does not use any isolation mode.

Nomad creates a working directory for each allocation on a client. Find this directory in the Nomad data_dir at ./alloc/«alloc_id». The allocation working directory is where Nomad creates task directories and directories shared between tasks, writes logs for tasks, and downloads artifacts or templates.

An allocation with two tasks (named task1 and task2) will have an allocation directory like this example.

.
├── alloc
│   ├── data
│   ├── logs
│   │   ├── task1.stderr.0
│   │   ├── task1.stdout.0
│   │   ├── task2.stderr.0
│   │   └── task2.stdout.0
│   └── tmp
├── task1
│   ├── local
│   ├── private
│   ├── secrets
│   └── tmp
└── task2
    ├── local
    ├── private
    ├── secrets
    └── tmp
  • alloc/: This directory is shared across all tasks in an allocation and can be used to store data that needs to be used by multiple tasks, such as a log shipper. This is the directory that's provided to the task as the NOMAD_ALLOC_DIR. Note that this alloc/ directory is not the same as the "allocation working directory", which is the top-level directory. All tasks in a task group can read and write to the alloc/ directory. But the full host path may differ depending on the task driver's filesystem isolation mode, so tasks should always used the NOMAD_ALLOC_DIR environment variable to find this path rather than relying on the specific implementation of the none, chroot, or image modes. Within the alloc/ directory are three standard directories:

    • alloc/data/: This directory is the location used by the ephemeral_disk block for shared data.

    • alloc/logs/: This directory is the location of the log files for every task within an allocation. The nomad alloc logs command streams these files to your terminal.

    • alloc/tmp/: A temporary directory used as scratch space by task drivers.

  • «taskname»: Each task has a task working directory with the same name as the task. Tasks in a task group can't read each other's task working directory. Depending on the task driver's filesystem isolation mode, a task may not be able to access the task working directory. Within the task/ directory are three standard directories:

    • «taskname»/local/: This directory is the location provided to the task as the NOMAD_TASK_DIR. Note this is not the same as the "task working directory". This directory is private to the task.

    • «taskname»/private/: This directory is used by Nomad to store private files related to the allocation, such as Vault tokens, that are not shared with tasks when using image isolation. The contents of files in this directory cannot be read by the nomad alloc fs command or the via Nomad's API. While not shared with tasks that use image isolation, this path is still accessible by tasks using chroot or none isolation

    • «taskname»/secrets/: This directory is the location provided to the task as NOMAD_SECRETS_DIR. The contents of files in this directory cannot be read by the nomad alloc fs command. It can be used to store secret data that should not be visible outside the task. Where possible it is backed by an in-memory filesystem and mounted noexec.

    • «taskname»/tmp/: A temporary directory used as scratch space by task drivers.

The allocation working directory is the directory you see when using the nomad alloc fs command. If you were to run nomad alloc fs against the allocation that made the working directory shown above, you'd see the following:

$ nomad alloc fs c0b2245f
Mode        Size     Modified Time         Name
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  alloc/
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  task1/
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  task2/

$ nomad alloc fs c0b2245f alloc/
Mode        Size     Modified Time         Name
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  data/
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  logs/
drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  tmp/

$ nomad alloc fs c0b2245f task1/
Mode         Size     Modified Time         Name
drwxrwxrwx   4.0 KiB  2020-10-27T18:00:33Z  local/
drwxrwxrwx   60 B     2020-10-27T18:00:32Z  private/
drwxrwxrwx   60 B     2020-10-27T18:00:32Z  secrets/
dtrwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  tmp/

Task Drivers and Filesystem Isolation Modes

Depending on the task driver, the task's working directory may also be the root directory for the running task. This is determined by the task driver's filesystem isolation capability.

image isolation

Task drivers like docker or qemu use image isolation, where the task driver isolates task filesystems as machine images. These filesystems are owned by the task driver's external process and not by Nomad itself. These filesystems will not typically be found anywhere in the allocation working directory. For example, Docker containers will have their overlay filesystem unpacked to /var/run/docker/containerd/«container_id» by default.

Nomad will provide the NOMAD_ALLOC_DIR, NOMAD_TASK_DIR, and NOMAD_SECRETS_DIR to tasks with image isolation, typically by bind-mounting them to the task driver's filesystem.

You can see an example of image isolation by running the following minimal job:

job "example" {
  datacenters = ["dc1"]

  task "task1" {
    driver = "docker"

    config {
      image = "redis:6.0"
    }
  }
}

If you look at the allocation working directory from the host, you'll see a minimal filesystem tree:

.
├── alloc
│   ├── data
│   ├── logs
│   │   ├── task1.stderr.0
│   │   └── task1.stdout.0
│   └── tmp
└── task1
    ├── local
    ├── private
    ├── secrets
    └── tmp

The nomad alloc fs command shows the same bare directory tree:

$ nomad alloc fs b0686b27
Mode        Size     Modified Time         Name
drwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  alloc/
drwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  task1/

$ nomad alloc fs b0686b27 task1
Mode         Size     Modified Time         Name
drwxrwxrwx   4.0 KiB  2020-10-27T18:51:54Z  local/
drwxrwxrwx   60 B     2020-10-27T18:51:54Z  private/
drwxrwxrwx   60 B     2020-10-27T18:51:54Z  secrets/
dtrwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  tmp/

$ nomad alloc fs b0686b27 task1/local
Mode  Size  Modified Time  Name

If you inspect the Docker container that's created, you'll see three directories bind-mounted into the container:

$ docker inspect 32e | jq '.[0].HostConfig.Binds'
[
  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/alloc:/alloc",
  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/local:/local",
  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/secrets:/secrets"
]

The root filesystem inside the container can see these three mounts, along with the rest of the container filesystem:

$ docker exec -it 32e /bin/sh
# ls /
alloc  boot  dev  home  lib64  media  opt   root  sbin     srv  tmp  var
bin    data  etc  lib   local  mnt    proc  run   secrets  sys  usr

Note that because the three directories are bind-mounted into the container filesystem, nothing written outside those three directories elsewhere in the allocation working directory will be accessible inside the container. This means templates, artifacts, and dispatch payloads for tasks with image isolation must be written into the NOMAD_ALLOC_DIR, NOMAD_TASK_DIR, or NOMAD_SECRETS_DIR.

To work around this limitation, you can use the task driver's mounting capabilities to mount one of the three directories to another location in the task. For example, with the Docker driver you can use the driver's mounts block to bind a secret written by a template block to the NOMAD_SECRETS_DIR into a configuration directory elsewhere in the task:

job "example" {
  datacenters = ["dc1"]

  task "task1" {
    driver = "docker"

    config {
      image = "redis:6.0"
      mounts = [{
        type     = "bind"
        source   = "secrets"
        target   = "/etc/redis.d"
        readonly = true
      }]

      template {
        destination = "${NOMAD_SECRETS_DIR}/redis.conf"
        data        = <<EOT
{{ with secret "secrets/data/redispass" }}
requirepass {{- .Data.data.passwd -}}{{end}}
EOT

      }
    }
  }
}

Note that relative mount source path are relative to the task working directory, so to bind the NOMAD_ALLOC_DIR as a mount source, you will need to use a relative path that traverses up into the allocation working directory (ex. source = "../alloc").

chroot isolation

Task drivers like exec or java (on Linux) use chroot isolation, where the task driver isolates task filesystems with chroot or pivot_root. These isolated filesystems will be built inside the task working directory.

You can see an example of chroot isolation by running the following minimal job on Linux:

job "example" {
  datacenters = ["dc1"]

  task "task2" {
    driver = "exec"

    config {
      command = "/bin/sh"
      args = ["-c", "sleep 600"]
    }
  }
}

If you look at the allocation working directory from the host, you'll see a filesystem tree that has been populated with the task driver's chroot contents, in addition to the NOMAD_ALLOC_DIR, NOMAD_TASK_DIR, and NOMAD_SECRETS_DIR:

.
├── alloc
│   ├── container
│   ├── data
│   ├── logs
│   └── tmp
└── task2
    ├── alloc
    ├── bin
    ├── dev
    ├── etc
    ├── executor.out
    ├── lib
    ├── lib32
    ├── lib64
    ├── local
    ├── private
    ├── proc
    ├── run
    ├── sbin
    ├── secrets
    ├── sys
    ├── tmp
    └── usr

Likewise, the root directory of the task is now available in the nomad alloc fs command output:

$ nomad alloc fs eebd13a7
Mode        Size     Modified Time         Name
drwxrwxrwx  4.0 KiB  2020-10-27T19:05:24Z  alloc/
drwxrwxrwx  4.0 KiB  2020-10-27T19:05:24Z  task2/

$ nomad alloc fs eebd13a7 task2
Mode         Size     Modified Time         Name
drwxrwxrwx   4.0 KiB  2020-10-27T19:05:24Z  alloc/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  bin/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  dev/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  etc/
-rw-r--r--   297 B    2020-10-27T19:05:24Z  executor.out
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib32/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib64/
drwxrwxrwx   4.0 KiB  2020-10-27T19:05:22Z  local/
drwxrwxrwx   60 B     2020-10-27T19:05:22Z  private/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  proc/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  run/
drwxr-xr-x   12 KiB   2020-10-27T19:05:22Z  sbin/
drwxrwxrwx   60 B     2020-10-27T19:05:22Z  secrets/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  sys/
dtrwxrwxrwx  4.0 KiB  2020-10-27T19:05:22Z  tmp/
drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  usr/

Nomad will provide the NOMAD_ALLOC_DIR, NOMAD_TASK_DIR, and NOMAD_SECRETS_DIR to tasks with chroot isolation. But unlike with image isolation, Nomad does not need to bind-mount the NOMAD_TASK_DIR directory because it can be directly created inside the chroot.

$ nomad alloc exec eebd13a7 /bin/sh
$ mount
...
/dev/mapper/root on /alloc type ext4 (rw,relatime,errors=remount-ro,data=ordered)
tmpfs on /private type tmpfs (rw,noexec,relatime,size=1024k)
tmpfs on /secrets type tmpfs (rw,noexec,relatime,size=1024k,noswap)
...

none isolation

The raw_exec task driver (or the java task driver on Windows) uses the none filesystem isolation mode. This means the task driver does not isolate the filesystem for the task, and the task can read and write anywhere the user that's running Nomad can.

You can see an example of none isolation by running the following minimal raw_exec job on Linux or Unix.

job "example" {
  datacenters = ["dc1"]

  task "task3" {
    driver = "raw_exec"

    config {
      command = "/bin/sh"
      args = ["-c", "sleep 600"]
    }
  }
}

If you look at the allocation working directory from the host, you'll see a minimal filesystem tree:

.
├── alloc
│   ├── data
│   ├── logs
│   │   ├── task3.stderr.0
│   │   └── task3.stdout.0
│   └── tmp
└── task3
    ├── executor.out
    ├── local
    ├── private
    ├── secrets
    └── tmp

The nomad alloc fs command shows the same bare directory tree:

$ nomad alloc fs 87ec7d12 task3
Mode         Size     Modified Time         Name
-rw-r--r--   140 B    2020-10-27T19:15:33Z  executor.out
drwxrwxrwx   4.0 KiB  2020-10-27T19:15:33Z  local/
drwxrwxrwx   60 B     2020-10-27T19:15:33Z  private/
drwxrwxrwx   60 B     2020-10-27T19:15:33Z  secrets/
dtrwxrwxrwx  4.0 KiB  2020-10-27T19:15:33Z  tmp/

But if you use nomad alloc exec to view the filesystem from inside the container, you'll see that the task has access to the entire root filesystem. The NOMAD_ALLOC_DIR, NOMAD_TASK_DIR, and NOMAD_SECRETS_DIR point to the filepath on the host, not a path anchored in the task working directory. And the task is running as root, because the Nomad client agent is running as root. This is why the raw_exec driver is disabled by default.

$ nomad alloc exec 87ec7d12 /bin/sh
# ls /
bin   dev  home        lib    lib64   lost+found  mnt  proc  run   snap  sys  usr  vmlinuz
boot  etc  initrd.img  lib32  libx32  media       opt  root  sbin  srv   tmp  var

# echo $NOMAD_SECRETS_DIR
/var/nomad/alloc/87ec7d12-5e35-8fba-96cc-09e5376be15a/task3/secrets

# whoami
root

Templates, Artifacts, and Dispatch Payloads

The other contents of the allocation working directory depend on what features the job specification uses. The allocation working directory is populated by other features in a specific order:

  • The allocation working directory is created.
  • The ephemeral disk data is migrated from any previous allocation.
  • CSI volumes are staged.
  • Then, for each task:

Dispatch payloads, artifacts, and templates are written to the task working directory before a task can start because the resulting files may be binary or image run by the task. For example, an artifact can be used to download a Docker image or .jar file, or a template can be used to render a shell script that's run by exec.

The artifact and template blocks write their data to a destination relative to the task working directory, not the NOMAD_TASK_DIR. For task drivers with image filesystem isolation, this means the destination field path should be prefixed with either NOMAD_TASK_DIR or NOMAD_SECRETS_DIR. Otherwise, the file will not be visible from inside the resulting container. (The dispatch_payload block always writes its data to the NOMAD_TASK_DIR.)

For CSI volumes, the client will stage the volume before setting up the task working directory. Staging typically involves mounting the volume into the CSI plugin's task directory, sending commands to the plugin to format the volume as required, and making a volume claim to the Nomad server.

The behavior of the volume_mount block is controlled by the task driver. The client builds a mount configuration describing the host volume or CSI volume and passes it to the task driver to execute. Because the task driver mounts the volume, it is not possible to have artifact, template, or dispatch_payload blocks write to a volume.