0% found this document useful (0 votes)

112 views

Guide Magv

MOSIX is a cluster operating system targeted for High performance Computing on x86 architectures. It incorporates dynamic resource discovery and automatic workload distribution. In a MOSIX system, users can run applications by creating multiple processes.

Uploaded by

miguelg_216

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views

Guide Magv

Uploaded by

miguelg_216

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 112

MOSIX

Cluster Operating System

Users and Administrators Guides and Manuals

Revised for MOSIX-3.4.0.6

May 2013

Preface
MOSIX1 is a cluster operating system targeted for High Performance Computing on x86 architectures Linux clusters, multi-clusters and Clouds. It incorporates dynamic resource discovery and automatic workload distribution, commonly found on single computers with multiple processors. In a MOSIX system, users can run applications by creating multiple processes, then let MOSIX seek resources and automatically migrate processes among nodes to improve the overall performance, without changing the run-time environment of the migrated processes.

Audience
This document is provided for general information. It is intended for users and system administrators who are familiar with general Linux concepts. Users of MOSIX are advised to rely on the documents provided with their specic MOSIX distribution.

Main changes in version 3

Using Linux kernel 3 for only the 64-bit X86 architecture. Removed support for tuning, topology and cluster partitions. All MOSIX programs start with mos, i.e., mosmon, mosmigrate, mosnative, mosbestnode, mosqmd, mossetpe, mostestload, mospostald, mosrc, mosrcd, mosremoted, mossetcl, mostimeof. Separated mosbatch from mosrun for running batch jobs (formerly mosrun -E). Separated mosbatchenv from mosenv for running batch jobs with protected environment variables (formerly mosenv -E). mosps dierentiates between batch jobs and migratable programs with an alternative home-node. Default for both mosrun -M and mosbatch is -b (select best node). mosctl name-changes. Updated manuals and guides.

Organization
The chapters in this document are arranged in four parts. The rst part provides general information about MOSIX, terminology and system requirements. The users guide is presented in the second part. It includes chapters on how to run MOSIX programs, running batch jobs, operating on your existing programs and jobs and the MOSIX Reach the Clouds package. The
1

MOSIX R is a registered trademark of A. Barak and A. Shiloh.

iii

PREFACE

administrator guide is in the third part. It includes chapters about congurations, storage allocation, managing jobs and security. The forth part includes the MOSIX manuals. Further information is available in the MOSIX web at http : //www.M OSIX.org.

Contents
Preface iii

Part I General
1 Terminology 2 What MOSIX is and is not 2.1 What MOSIX is . . . . . . . . . . . . . . . 2.1.1 The main cluster features of MOSIX 2.1.2 Additional multi-cluster features . . 2.1.3 Additional Cloud features . . . . . . 2.2 What MOSIX is not . . . . . . . . . . . . . 3 System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
3 5 5 5 6 6 6 7

Part II Users Guide

4 Running MOSIX programs 4.1 The basics . . . . . . . . . . . . . . . . . . 4.2 Advance options . . . . . . . . . . . . . . 4.3 Manual migration . . . . . . . . . . . . . . 4.4 Spawning Native Linux child programs . . 4.5 Using pipes between migrated programs . 4.6 Find how long your program was running 4.7 Error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
11 11 12 14 14 15 15 15 19 19 19 19 21 21 22 22 23 v

5 Running batch jobs 5.1 The dierences between batch jobs and migratable programs . . . . . . . . . . . 5.2 When to run a batch job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 How to run a batch job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Operating on your existing programs and jobs 6.1 Listing your MOSIX-related processes . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Listing and controlling queued programs and jobs . . . . . . . . . . . . . . . . . . 6.3 Killing your programs and jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 MOSIX Reach the Clouds (MOSRC)

CONTENTS

Part III Administrators Guide

8 Conguration 8.1 General . . . . . . . . . . . . . . . . . . . . . . . 8.2 Conguring the single cluster . . . . . . . . . . . 8.2.1 Participating nodes . . . . . . . . . . . . . 8.2.2 Advanced options . . . . . . . . . . . . . 8.3 Conguring the multi-cluster . . . . . . . . . . . 8.3.1 Partner-clusters . . . . . . . . . . . . . . . 8.3.2 Which nodes are in a partner-cluster . . . 8.3.3 Partner-cluster relationship . . . . . . . . 8.3.4 Priorities . . . . . . . . . . . . . . . . . . 8.3.5 Priority stabilization . . . . . . . . . . . . 8.3.6 Maximum number of guests . . . . . . . . 8.4 Conguring the queuing system . . . . . . . . . . 8.4.1 Queuing is an option . . . . . . . . . . . . 8.4.2 Selecting queue-managers . . . . . . . . . 8.4.3 Advanced . . . . . . . . . . . . . . . . . . 8.5 Conguring the freezing policies . . . . . . . . . . 8.5.1 Overview . . . . . . . . . . . . . . . . . . 8.5.2 Process classes . . . . . . . . . . . . . . . 8.5.3 Freezing-policy details . . . . . . . . . . . 8.5.4 Disk-space for freezing . . . . . . . . . . . 8.5.5 Ownership of freezing-les . . . . . . . . . 8.6 Conguring parameters of mosrun . . . . . . . . 8.7 Conguring MOSRC (MOSIX Reach the Clouds) 8.7.1 Which nodes . . . . . . . . . . . . . . . . 8.7.2 Which users and groups . . . . . . . . . . 8.7.3 Prohibited directories . . . . . . . . . . . 8.7.4 Empty directories . . . . . . . . . . . . . 8.7.5 Predened security levels . . . . . . . . . 8.7.6 On the launching side . . . . . . . . . . . 9 Storage allocation 9.1 Swap space . . . 9.2 MOSIX les . . . 9.3 Freezing space . . 9.4 Private-le space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
27 27 28 28 28 29 29 29 30 30 30 31 31 31 31 31 33 33 33 33 35 35 35 36 36 36 36 37 37 37 39 39 39 39 40 41 41 41 41 41 41 42

. . . .

10 Managing jobs 10.1 Monitoring (mosmon) . . . . . . . . . . . . . . . 10.2 Listing MOSIX processes (mosps) . . . . . . . . . 10.3 Controlling running processes (migrate) . . . . . 10.4 Viewing and controlling queued processes (mosq) 10.5 Controlling the MOSIX node (mosctl) . . . . . . 10.6 If you wish to limit what users can run . . . . . .

. . . . . .

CONTENTS 11 Security 11.1 Abuse by gaining control of a node . . 11.2 Abuse by connecting hostile computers 11.3 Multi-cluster password . . . . . . . . . 11.4 Organizational multi-cluster . . . . . . 11.5 Batch password . . . . . . . . . . . . .

vii 45 45 45 45 45 46

. . . . .

Part IV Manuals
12 Manuals 12.1 For users . . . . . . 12.2 For programmers . 12.3 For administrators 12.4 Special package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47
49 49 49 49 49

viii

CONTENTS

Part I General

PART I

Chapter 1

Terminology
The following terms are used throughout this document: Node - a participating computer (physical or virtual), whose unique IP address is congured to be part of a MOSIX cluster or multi-cluster. Processor - a CPU (Central Processing Unit or a Core): most recent computers have several processors. (Hyper-Threads do not constitute dierent processors). Process - a unit of computation that is started by the fork (or vfork) system call and maintains a unique identier (PID) throughout its life-time (for the purpose of this document, units of computation that are started by the clone system call are called threads and are not included in this denition). Program - An instance of running an executable le: a program can result in one or more processes. Batch Job - An instance of running a non-migratable executable le (along with given parameters and environment). Launching-node - The node from which a user (or a script) invokes mosrun, mosbatch or mosrc (note that MOSIX has no concept of a head-node: while the system- administrator may choose to assign head-nodes, MOSIX does not require it and is not aware of that). Home-node - The node to which a migratable process belongs: a migratable process sees the world (le-systems, network, other processes, etc.) from the perspective of this node. The home-node is usually the launching node, except when using mosrun -M when they can dier. Home-cluster - the cluster to which the home-node of a process belongs. Local process - a process that runs in its home-node. Guest process - a process whose home-node is elsewhere, but is currently running here (on the node being administered). Cluster - one or more computers - workstations, servers, blades, multi-core computers, etc. possibly of dierent speeds and number of processors, called nodes, that are owned and managed by the same entity (a person, a group of people or a project) in which all the nodes run the same version of MOSIX and are congured to work tightly together. Note that a MOSIX cluster can at times be dierent than hardware clusters. For example, it can consist of several hardware-clusters or just part of a hardware-cluster. 3

CHAPTER 1. TERMINOLOGY

Multi-cluster - a collection of MOSIX clusters that run the same version of MOSIX and are congured to work together. A MOSIX multi-cluster usually belongs to the same organization, but each cluster may be administrated by a dierent owner or belongs to a dierent group. These owners trust each other and wish to share some computational resources among them. Cloud - a collection of entities such as MOSIX clusters, MOSIX multi-clusters, Linux clusters (such as a group of Linux servers), individual workstations and Virtual Machines (VM), in which nodes in each entity are aware of one or more nodes in other entities. Each entity may possibly run a dierent version of Linux or MOSIX. In a MOSIX Cloud, dierent entities are usually administrated by dierent owners and rarely share any le systems. Your - cluster, addresses, nodes, computers, users, etc. that the sysadmin currently administer or congure.

Chapter 2

What MOSIX is and is not

2.1 What MOSIX is

MOSIX is an extension of the Linux operating system for managing clusters, multi-clusters and Clouds eciently. MOSIX is intended primarily for High Performance Computing (HPC). The main tool employed by MOSIX is preemptive process migration - a process may start on one node, then transparently move to other nodes. Migration is repeated as necessary (possibly even returning to where it started). Process migration occurs automatically and transparently, in response to resource availability. Process migration is utilized to optimize the overall performance.

2.1.1

The main cluster features of MOSIX

Provides a single-system image. Users can login on any node and do not need to know where their programs run. No need to modify or link applications with special libraries. No need to copy les to remote nodes. Automatic resource discovery and workload distribution: Load-balancing by process migration. Migrating processes from slower to faster nodes and from nodes that run out of free memory. Migratable sockets for direct communication between migrated processes. Provides a secure run time environment (sandbox) for guest processes. Supports live queuing - queued jobs preserve their full generic Linux environment. Supports batch jobs. Supports checkpoint and recovery. Supports 64-bit x86 architectures. Includes tools for automatic installation and conguration. Includes an on-line monitor. 5

CHAPTER 2. WHAT MOSIX IS AND IS NOT

2.1.2

Additional multi-cluster features

Supports disruptive congurations: Clusters can join or leave the multi-cluster at any time. Guest processes move out before disconnecting a cluster. Clusters can be shared symmetrically or asymmetrically. Cluster owner can assign dierent priorities to guest processes from other clusters.

2.1.3

Additional Cloud features

Supports all the above clusters and multi-clusters features. Nodes can be added or be disconnected from the Cloud at any time. Can run in non-virtualized or VM environments. Can run in any platform that supports virtualization, e.g. Linux or Windows. MOSIX Reach the Clouds (MOSRC) is a tool that allows applications to start on one computer and run in remote nodes on other clusters, e.g. on Clouds, without pre-copying les to these remote nodes. The main features of MOSRC are: Runs on both MOSIX clusters and Linux computers (with unmodied kernel). No need to pre-copy les to remote clusters. Applications can access both local and remote les. Supports le sharing among dierent computers. Stdin/out/err are preserved locally. Can be combined with mosrun on remote MOSIX clusters.

2.2

What MOSIX is not

A Linux distribution. A Linux kernel. A cluster set-up and installation tool. MOSIX does not: Improve performance of intensive I/O programs. Improve performance of non-computational server-applications, such as web or mail servers. Support high-availability. Support shared-memory and threaded programs.

Chapter 3

System requirements
Any combination of 64-bit computers of the x86 architecture. Multiprocessor computers (SMP, dual-core, quad-core or multi-core) are supported, but all the processors/cores of each node must be of the same speed. All the nodes must be connected to a network that supports TCP/IP and UDP/IP. Each node should have a unique IP address in the range 0.1.0.0 to 255.255.254.255 that is accessible to all the other nodes. TCP/IP ports 250-253 and UDP/IP ports 249-250 and 253 should be reserved for MOSIX (not used by other applications or blocked by a rewall). MOSIX can be installed on top of any Linux distribution: mixing of dierent Linux distributions on dierent nodes is allowed.

CHAPTER 3. SYSTEM REQUIREMENTS

Part II Users Guide

PART II

Chapter 4

Running MOSIX programs

4.1 The basics

MOSIX allows your programs to be migrated within a computer-cluster, or even within a multicluster. A process can migrate either automatically such as to achieve load-balancing, or by manual user requests. No matter how many times a process migrates, it continues to transparently behave as if it was running on its home-node, which is usually where it was launched. For example, it only uses the les, sockets and similar resources of its home-node. To run a migratable program, use: mosrun [flags] {program} [arguments] Following are the basic ags to use: -b Start your program on the best available node (computer). If unsure, use it! -g Allow your program to run on the whole multi-cluster. If unsure, use it! -m {mb} Tell MOSIX how much memory your program may require, in Megabytes. If the amount changes during execution, use the maximum. Use this option when you can, also as a courtesy to other users. -e OR -w If mosrun complains about unsupported functions, then these can help relax some restrictions so you can run your program anyway. In almost all cases, if your program runs, it will still run correctly. Running the C-shell (csh) for example, requires this option. The dierence between -e and -w is that -w also logs all occurrences of unsupported functions to the standard-error and is therefore useful for debugging, whereas -e is silent. -M OR -M/tmp Run your program from a possibly dierent node, using the les and other resources of that node instead of the launching node. Use -M if your current-directory exists on all the other nodes of your cluster, or -M/tmp if it doesnt. -q Queue your program, rather than start it straight away. In some installations, your system-administrator can make queuing compulsory and even enforce it automatically 11

12 without the -q argument.

CHAPTER 4. RUNNING MOSIX PROGRAMS

Queuing is important when you want to run too many programs at once how many programs are too-many depends on the size of your cluster or multi-cluster and on what other users are doing). Your program will then share the common per-cluster queue with other users and/or your other programs.
-S{maxprograms} The common queue is good because it adapts well to the changing conditions in the cluster, but it can only handle a limited number of programs, typically a few 1000s (and your systemadministrator may enforce a xed limit). If you want to run a larger set of programs, you should create your own private queue, using: mosrun -S{maxprograms} [other-flags] {commands-file} where {commands-le} contains one command per line. mosrun will then run {maxprograms} commands at any given time until all programs complete. You may still use the common queue AS WELL (using the -q ag) and the above ags (-b, -G, -mmb, -e or -w) are still recommended.

Example 1: Running a single program without queuing. The program is estimated to require 400MB of memory at most: % mosrun -b -G -m400 simple program Example 2: Running a large set of programs using both queues. Each program requires a memory of 500MB, the private queue is used to limit the maximum number of simultaneous programs to 800. The programs are allowed to run with a dierent home-node. % cat my my my my my my script program program program program program

-x -x -x -x -x

-y -y -y -y -y

-z -z -z -z -z

< < < < <

/home/myself/file1 /home/myself/file2 /home/myself/file3 /home/myself/file4 /home/myself/file5

> > > > >

/home/myself/output1 /home/myself/output2 /home/myself/output3 /home/myself/output4 /home/myself/output5

% mosrun -M/tmp -b -G -m500 -q -e -S200 my script

4.2

Advance options

You can tell mosrun where to start your program, using one of the following arguments: -b -h -r{host} -r{IP-address} -{number} -j{list} Allow MOSIX to choose the best place to start. On your own computer. On the given computer. On the computer with the given IP-address. On the given node-number (set by the system-administrator). Select randomly out of the comma-separated list.

4.2. ADVANCE OPTIONS

While this usually determines where the program will start running, when the -M ag is used, it determines where the programs home- node will be. When -M is used without -b, the program will also start running in its home-node, but when both the -M and -b ags are used, the program may start elsewhere. When -M is used in a multi-cluster, the selection must be within the local cluster. If you request to start your program on a computer that is down or refuses to run it, the program will fail to start unless you also provide the -F ag, in which case the program will then start on the launching node. By default, programs are only allowed to migrate within their home- nodes cluster. To allow them to migrate to other clusters within the organizational multi-cluster, use the -G ag. A class-number may follow the -G (such as -G15): class-numbers are usually supplied by your system-administrator. The -L ag prevents automatic migrations. Your program may still be migrated manually, and will still forcefully migrate back to its home-node in certain emergencies. The -l ag cancels -L: this is useful, for example, when mosrun is called by a shell-script that already runs under mosrun -L. Certain Linux features, such as shared-memory, certain system-calls and certain system-call parameters, are not supported. By default, when such features are encountered, the program is terminated. The -e (silent) or -w (verbose) options allow the program to continue and instead only fail the unsupported system-calls as they occur. The -u option cancels both -e and -w. By default, the gettimeofday() system-call returns the clock-time of the home-node, so programs need to contact their home-node to obtain it. Some programs use gettimeofday() very frequently, making this quite expensive. The -t option allows your program to fetch the clock-time from the node where it is running instead. The -T option cancels -t. To run a program with a home-node other than its launching-node (where mosrun is called), use the -M ag. The letter M can be immediately followed by a directorys name (for example, M/tmp), to become the current-directory for the program - otherwise, the program will start in the directory named as the current-directory on the launching node (and if that directory is missing or inaccessible there, the program will fail). With -M, the -i ag gives the program an exclusive use of its standard-input: it is usually advisable to use it, except when several programs can read from the same input, such as your terminal (which is not common). When using the common queue (the -q argument), if you expect your program to split into a number of parallel processes, please use the Pnumber-of-processes option, so the queuingsystem can take this into account. Doing this is also a matter of courtesy to other users. You may replace the -q argument with -Q. The dierence is that while normally, if the program is aborted from the queue (See man mosq), or if the queuing system fails, the program is killed, while with the -Q argument it starts running instead as if it was never queued. When using your private queue, if you want to nd which programs (if any) failed, you can specify a second le-name: mosrun -S{number} {script-file},{fail-file} Command-lines that failed will then be listed in {fail-le}. The -J{JobID} argument associates several instances of mosrun with a single job ID for easy identication and manipulation (by mosq, migrate, mosps and moskillall - see the corresponding manual pages). Job-IDs can be either a non-negative integer or a token from the le $HOME/.jobids: if this le exists, each line in it contains a number (JobID) followed

CHAPTER 4. RUNNING MOSIX PROGRAMS

by a token that can be used as a synonym to that JobID. The default JobID is 0. Job IDs are inherited by child processes. Job IDs are not supported in combination with the -M ag. The -D{timespec} allows you to provide an estimate on how long you believe your job should run. For details about {timespec}, see man mosrun. This allows you to view the estimated remaining time using mosps -D. Periodic automatic checkpoints can be produced using the A{minutes} argument. By default, checkpoints are saved to les whose names start with ckpt.{process-ID}: the C{lename} argument can be used to select dierent le-names. Checkpoint-les have a numeric extension to determine the checkpoint-number, such as myckpt.1, myckpt.2, myckpt.3. The N{max} argument can be used to limit the number of checkpoint- les: once that maximum is reached, checkpoint-numbers will start again at 1, so new checkpoints will override the earlier checkpoints. To resume running from a checkpoint le, use mosrun -R{checkpoint- le}. It is also possible to resume a program with dierent opened-les than the les that were open when the checkpoint was taken - for details, see man mosrun. Private temporary-directories can be specied with the -X option. Other advanced options (-c, -n, -d) can aect the automatic migration-considerations of a program (see man mosrun for details).

4.3

Manual migration

The following commands can be used for manual migration: mosmigrate {pid} {hostanme or IP-address or node-number} to migrate a process to the given computer. mosmigrate {pid} home to migrate a process back to to its home-node. mosmigrate {pid} freeze to freeze your process. mosmigrate {pid} continue to unfreeze your process. mosmigrate {pid} checkpoint to cause your process to generate a checkpoint. mosmigrate {pid} chkstop to cause your process to generate a checkpoint and stop with a SIGSTOP signal. mosmigrate {pid} chkexit to cause your process to generate a checkpoint and exit.

4.4

Spawning Native Linux child programs

Once a program (including a shell) runs under mosrun, all its child-processes will be migratable as well, but also subject to the limitations of migratable programs (such as being unable to use shared-memory and threading). If your shell, or shell-script, is already running under mosrun and you want to run some program as a standard Linux program, NOT under MOSIX, use the command: mosnative {program} [args]...

4.5. USING PIPES BETWEEN MIGRATED PROGRAMS

4.5

Using pipes between migrated programs

It is common to run two or more programs in a pipeline, so that the output of the rst becomes the input of the second, etc. You can do this using the shell: program1 | program2 | program3 ... If your shell (or shell-script) that generates the pipeline is running under mosrun and the amount of data transferred between the programs is large, this operation can be quite slow. Eciency can be gained by using the MOSIX direct-communication feature. Use: mospipe program1 [args1]..." program2 [args2]..." program3... mospipe can substitute mosrun, so you do not need to use mosrun mospipe and the arguments that inform mosrun where to run can be given to mospipe instead. Complete details can be found in man mospipe.

4.6

Find how long your program was running

Run timeof {pid} to nd out the total user-level running-time that was accumulated by a MOSIX process. {pid} is the process-ID (which could be obtained by mosps or mosq). Several process- IDs may be specied at once (timeof {pid1} {pid2} ...).

4.7

Error messages

The following group of errors indicate that the program encountered a feature that is not supported by mosrun: system-call {system-call-name} not supported under MOSIX Shared memory (MAP SHARED) not supported under MOSIX Attaching SYSV shared-memory not supported under MOSIX Prctl option #{number} not supported under MOSIX IPC system-call #{number} not supported under MOSIX Sysfs option {number} not supported under MOSIX Ioctl 0x{hexadecimal-number} not supported under MOSIX Mapping special character les not supported under MOSIX getpriority/setpriority supported under MOSIX only for self If you see any of the above errors you may either: 1. Use mosrun -e (or mosrun -w) to make the program continue anyway (although the unsupported feature will fail) 2. Run the program without mosrun or as a batch job (using mosbatch). 3. Modify your program so it does not use the unsupported feature.

16 Other errors include:

CHAPTER 4. RUNNING MOSIX PROGRAMS

kernel does not support full ptrace options - Make sure that a kernel with the MOSIX kernel-patch is properly installed. failed allocating memory - There is not enough memory available on this computer: try again later. illegal system call #{number} - The program attempted to run a system-call with a bad number: there could be a bug in the program, or the MOSIX version is very old and new system-calls were added since. sysfs detected an unreasonably long le-system name - The size of the buer provided to the sysfs() system-call is unreasonably large (more than 512 bytes - probably a fault in the library). WARNING: setrlimit(RLIMIT NOFILE) ignored by MOSIX - MOSIX does not allow programs to change their open-les limit. File-descriptor #{number} is open (only 1024 les supported under MOSIX) mosrun was called with a open le-descriptor numbered 1024 or higher: this is not supported. Failed reading memory-maps - Either /proc is not mounted, or the kernel is temporarily out of resources. Failed opening memory le - Either proc is not mounted, or the kernel is temporarily out of resources. Kernel too secure to run MOSIX (by non-Super-User) - In older MOSIX releases, the CONFIG SECURITY kernel-option conicted with MOSIX (allowing only the SuperUser to use mosrun). This is no longer a problem in the latest MOSIX releases. Kernel missing the MOSIX patch - mosrun cannot run without the MOSIX kernelpatch. failed migrating to {computer}: {reason} - Failed attempt to start the program on the requested computer. Reasons include: not in map - other computer is not recognized as part of this cluster or multi-cluster. outside cluster - requested computer is in a dierent cluster, but no mosrun -G option was given. no response - the most likely reasons are that MOSIX is not running on the requested computer, or a re-wall blocking TCP/IP port 253 on the requested computer. other node refused - the requested computer was not willing to accept the program. other node has no MOSIX kernel - the requested computer must have the MOSIX kernel-patch installed in order to be able to accept guest programs. other node has a wrong kernel - the requested computer must have a MOSIX kernel-patch that matches its usermode MOSIX version in order to be able to accept guest programs.

4.7. ERROR MESSAGES

did not complete (no memory there?) - there were probably not enough resources to complete the migration, or perhaps the requested computer just crashed or was powered-o. To run outside the cluster ({computer}), you must use -G - To allow your program to run in another cluster of a multi-cluster, you must use the mosrun -G option. failed sending job to {computer} - The batch-job failed to start on the given computer: the most common reason is that the requested computer is down or does not run MOSIX. {computer} is too busy to receive now - The requested computer refused to run the batch-job. could not enter directory ({directory}) on {computer} - The requested computer does not have the given directory, where the batch job is supposed to start (or perhaps that directory exists but you have no permission to enter it): consider using mosrun -E{directory} (or mosrun -M{directory}). connection timed out - The other computer stopped responding while preparing to run a batch-job. Perhaps it crashed, or perhaps it runs a dierent version of MOSIX, or perhaps even a dierent daemon is listening to TCP/IP port 250. batch refused by other party ({computer}) - The requested computer refused to run the batch-job from this computer. Lost communication with {computer} - The TCP/IP connection with the computer that was running the program was severed. Unfortunately this means that the program had to be killed. Process killed while attempting to migrate from {computer1} to {computer2} Connection was severed while the program was migrating from one remote computer to another. Unfortunately this means that the program had to be killed. Unfreeze failed - The program was frozen (usually due to a very high load), then an attempt to un-freeze it failed, probably because there was not enough memory on this computer. Recovery was not possible. Failed decompressing freeze le - The program was frozen (usually due to a very high load), but there were not enough resources (memory/processes) to complete the operation and recovery was not possible. Re-freezing because unfreeze failed - The program was frozen (usually due to a very high load), then an attempt to un-freeze it failed, probably because there was not enough memory on this computer. Recovery was possible by re-freezing the program: you may want to manually un-freeze it later when more memory is available. No space to freeze - The disk-space that was allocated for freezing (usually by the systemadministrator) was insucient and so freezing failed. The MOSIX conguration indicated not to recover in this situation, so the program was killed. Security Compromised - Please inform this to your system-administrator and ask them to run mosconf, select the Authentication section and set new passwords immediately. Authentication violation with {computer} - The given computer does not share the same password as this computer: perhaps someone connected a dierent computer to the network which does not really belong to the cluster? Please inform your system-administrator! Target node runs an incompatible version - Your computer and the computer on which you want to start your batch job do not run the same (or a compatible) version of MOSIX.

CHAPTER 4. RUNNING MOSIX PROGRAMS

{program} is a 32-bit program - will run in native Linux mode - 32-bit programs are not migratable: your program will run instead as a standard Linux program: consider re-compiling your program for the 64-bit architecture. remote-site (i{computer}) seems to be dead - No heart-beat detected from the computer on which your program runs. remote-site ({computer}) seems to be dead - No heart-beat detected from the computer on which your program runs. Corrupt or improper checkpoint le - Perhaps this is the wrong le, or was tempered with, or the checkpoint was produced by an older version of MOSIX that is no longer compatible. Could not restart with {lename}: {reason} - Failed to open the checkpoint-le. File-descriptor {le-number} was not open at the time of checkpoint! - When continuing from a checkpoint, attempted to redirect a le (using mosrun -R -O, see man mosrun) that was not open at the time of checkpoint. Restoration failed: {reason} - Insucient resources to restart from the checkpoint: try again later. checkpoint le is compressed - but no /usr/bin/lzop here! The program usr/bin/lzop is missing on this computer (perhaps the checkpoint was taken on another computer?). WARNING: no write-access in checkpoint directory! - You requested to take checkpoints, but have no permission to create new checkpoint-les in the specied directory (the current-directory by default). Checkpoint le {lename}: {reason} - Failed to open the checkpoint-le for inspection. Could not restore le-descriptor {number} with {lename}: {reason} - When continuing from a checkpoint, the attempt to redirect the given opened-le failed. Restoration failed - The checkpoint le is probably corrupt. problem with /bin/mosqueue - The program mosqueue is missing - please contact your system- administrator: mosqueue should be in either /bin, /usr/bin or /usr/local/bin. Line #{line-number} is too long or broken! - The given line in the script-le (mosrun -S) is either too long or does not end with a line-feed character. Commands-le changed: failed-commands le is incomplete! - The script-le was modied while mosrun -S is running: you should not do that! Failed writing to failed-commands le! - Due to some write-error, you will not be able to know from {fail-le} which of your commands (if any) failed. Invalid private-directory name ({name}) - Private-directories (where private-temporaryles live) must start with / and not include ... Disallowed private-directory name ({name}) - Private-directories (where privatetemporary-les live) must not be within /etc, /proc, /sys or /dev. Too many private directories - The maximum is 10. Private directory name too long - The maximum is 256 characters. Insucient disk-space for private les - As your program migrated back to your computer, it was found that it used more private-temporary-le space than allowed. This is usually a conguration problem (has your system- administrator decreased this limit while your program was running on another computer?).

Chapter 5

Running batch jobs

5.1 The dierences between batch jobs and migratable programs

The main dierences between migratable programs and batch jobs are: 1. Batch jobs are stationary - they begin, stay and end in the same node. 2. Batch jobs use the les that are available where they run (except for their standardinput/output/error, which is redirected from/to their launching-node), whereas migratable programs do all their Input/Output on their home-node. Note that the fact that some les (such as NFS-mounted les) happen to be available on both nodes (the home-node and the remote-node where they run) makes no dierence for migratable programs - it is a common mistake to believe that such les will be read/written directly on the remote-node: for security and compatibility reasons, les will still be read/written via the home-node. 3. MOSIX does not support certain functions (such as shared-memory and threading) for migratable programs. Batch jobs may access all Linux functions.

5.2

When to run a batch job

You should run a batch job when either: 1. Your program cannot run under mosrun because it uses unsupported function(s). 2. Your program cannot run under mosrun because you have not installed the MOSIX kernel patch. 3. Your program would be slow under mosrun because it does a considerable amount of I/O. 4. You do not require process-migration.

5.3

How to run a batch job

To run a batch job, use: mosbatch [flags] {program} [arguments] 19

CHAPTER 5. RUNNING BATCH JOBS The following commonly-used ags are similar to those of mosrun (above): -m{mb} -i -q -S{maxjobs} (recommended) (recommended unless interactive) (to queue with the common queue) (to queue with a private queue)

The following ags are also similar to those of mosrun, but are rarely used: -b OR -h OR -r{host/IP} OR -{number} OR -j{list} -Q -P{number-of-processes} -F -z -J -D If you run mosbatch from a directory that is not present on all the nodes of the cluster, or if you wish to run your program with a dierent current-directory than your current one on the launching- node, use the ag: -E{/directory} (e.g. -E/tmp)

This causes the job to run in the given current-directory (otherwise, the batch job will attempt to use the directory named as the current-directory on the launching-node as its currentdirectory). Example 1: A single job requiring 1GB memory and using the common queue: % mosbatch -q -i -m1024 my program arg1 arg2 arg3 Example 2: A large set jobs requiring 250MB each, using the common queue as well as a private in order to release no more than 200 jobs at a time: % cat my script batch program batch program batch program batch program batch program

-a -a -a -a -a

-b -b -b -b -b

-c -c -c -c -c

< < < < <

/home/myself/data1 /home/myself/data2 /home/myself/data3 /home/myself/data4 /home/myself/data5

> > > > >

/home/myself/out1 /home/myself/out2 /home/myself/out3 /home/myself/out4 /home/myself/out5

% mosbatch -E/tmp -m250 -q -S200 my script

Chapter 6

Operating on your existing programs and jobs

6.1 Listing your MOSIX-related processes

The program mosps is similar to ps and shares many of its parameters. The main dierences are that: 1. mosps shows only MOSIX and related processes. 2. mosps shows relevant MOSIX information such as where your processes are running. 3. mosps does not show running statistics such as CPU time and memory-usage (because this information is not readily available for processes that run on remote computers). The most important information is under the column WHERE, showing where your processes are running. This can be a node-number, an IP address, or the word here (if the process is running on your own computer). If you prefer IP addresses, use the -I ag; if you prefer the full host-name, use the -h ag; and if you prefer just the host-name (without the domain), use the -M ag. Other special values for WHERE are: queue - on the common queue. Mwait - migratable program waiting for a suitable node to start on. Hwait - mosrun -M program waiting for a suitable home-node. Bwait - batch job waiting for a suitable node to start on. The CLASS column shows whether your processes can migrate outside the cluster (in a multi-cluster conguration): the local class cannot migrate outside the cluster; otherwise the class-number is usually determined by the mosrun -G{class} option. The FRZ column shows whether your processes are frozen and if so why. The possible reasons are: A: Automatic freezing occurred (usually due to a high load - once the local load is reduced it will be automatically unfrozen). E: The process was frozen because it was expelled from another cluster: it should be automatically unfrozen as soon as the local load drops. 21

CHAPTER 6. OPERATING ON YOUR EXISTING PROGRAMS AND JOBS

P: An external package requested to freeze the process (it is up to that package to unfreeze it). M: The process was frozen manually (by yourself or the system- administrator). If you run mosps -N, you also see the NMIGS column, listing how many times your processes have ever migrated. When running programs or jobs with a private queue, you can use mosps -S to nd out how many programs and/or jobs completed and how many failed so far. Only programs that were started with mosrun -S or mosbatch -S will be shown.

6.2

Listing and controlling queued programs and jobs

Running mosq list lets you to see which programs and jobs are waiting in the common queue (not just yours), in the order that they are queued. Running mosq listall shows also programs and jobs that were started by the common queue and are now already running. Mosq provides the following columns of information: PID USER MEM(MB) MULTI PRI FROM COMMAND process-ID user-name amount of memory requested in MegaBytes whether the programs can migrate outside the cluster the lower the number, the higher the priority originating computer command-line

Use [mosq run {pid}] to force your queued-program to bypass the queue and start immediately (your system-administrator may not allow this - please use with caution and be considerate to other users). Use [mosq abort {pid}] to abort a queued program. Use [mosq cngpri {new-priority} {pid}] to modify the priority of a queued program: the lowest/best priority is 0 and the lower the new priority, the sooner the program will start (your system-administrator may not allow to decrease the priority - please be considerate to other users). Use [mosq advance {pid}] to advance your queued program to the head of the queue among all programs of the same priority (your system- administrator may not allow this - please be considerate to other users). Use [mosq retard {pid}] to bring your queued program to the end of the queue among all programs of the same priority.

6.3

Killing your programs and jobs

Use [moskillall] to kill all your MOSIX programs and batch jobs (with the SIGTERM signal). Use [moskillall -{signal}] to send a signal to all your MOSIX processes and jobs. Use [moskillall -G{class}] to kill/signal all your migratable processes of a given class (moskillall -G0 to kill/signal all processes that were not started by mosrun -G).

Chapter 7

MOSIX Reach the Clouds (MOSRC)

The mosrc program can be used to run programs on other computers which are not necessarily part of your MOSIX cluster(s) and perhaps do not even run MOSIX at all. The dierence between mosrc and programs like rsh, rlogin or ssh is that selected local directories are made accessible to your program on the target computer as part of its le-system (but not accessible to other users there). You may export (make accessible) any number of directories to the target computer, using: mosrc -d{/dir1,/dir2,/dir3} -r{target-computer} program [args]... (directory-names must begin with a /). Some directories of the target le-system are not permitted to be replaced by exported directories (the list is subject to conguration on the target computer, but always includes the root /, /proc, /sys, /dev and /mosix and often also /lib and /usr/lib). If you want to export one of these, you can export it under a dierent name, for example: mosrc -d/lib=/tmp/mylib,/=/tmp/myroot {program} [args]... Your program normally runs with a current-directory named as your current-directory (but on the target computer). To run it with a dierent current-directory, use the -c/other-dir argument. Complete details can be found in man MOSRC.

CHAPTER 7. MOSIX REACH THE CLOUDS (MOSRC)

Part III Administrators Guide

PART III

Chapter 8

Conguration
8.1 General

The script mosconf will lead you step-by-step through the MOSIX conguration. Mosconf can be used in two ways: 1. You can congure (or re-congure) MOSIX on the cluster-node where it should run. If so, just press <Enter> at the rst question that mosconf presents. 2. In clusters (or parts of clusters) that have a central repository of system-les, containing their root image(s), you can make changes in the central repository instead of having to manually update each node separately. This repository can for example be NFS-mounted by the cluster as the root le-system, or it can be copied to the cluster at boot time, or perhaps you have some cluster-installation package that uses other methods to reect those les to the cluster. Whichever method is used, you must have a directory on one of your servers, where you can nd the hierarchy of system-les for the clusters (in it you should nd subdirectories such as /etc, /bin, /sbin, /usr, lib, mnt, proc and so on). At the rst question of mosconf, enter the full pathname to this repository. When modifying the conguration there is no need to stop MOSIX - most changes will take eect within a minute. However, after modifying any of the following: The list of nodes in the cluster (/etc/mosix/mosix.map). The IP address used for MOSIX (/etc/mosix/mosip). The nodes topological features (/etc/mosix/myfeatures), you must commit your changes by running the command mossetpe - however, this is not necessary when you are using mosconf locally (option 1 above). The MOSIX conguration is maintained in the directory etc/mosix. 27

CHAPTER 8. CONFIGURATION

8.2
8.2.1

Conguring the single cluster

Participating nodes

The most important conguration task is to inform MOSIX which nodes participate in your cluster. In mosconf you do this by selecting Which nodes are in this cluster. Nodes are identied by their IP address (see the advanced options below if they have more than one): commonly the nodes in a cluster have consecutive IP addresses, so it is easy to dene them using the IP address of the rst node followed by the number of nodes in the range, for example, if you have 10 nodes starting from 192.168.3.1 to 192.168.3.10, type 192.168.3.1 followed by 10. If there are several such ranges, you need to specify all of them and if there are nodes with an isolated IP address, you need to specify them as ranges of 1. If your IP addresses are mostly consecutive, but there are a few holes due to some missing computers, it is not a big deal - you can still specify the full range, including the missing computers (so long as the IP addresses of the holes do not belong to other computers elsewhere). Specifying too many nodes that do not actually exist (or are down) has been known to produce excessive ARP broadcasts on some networks due to attempts to contact the missing nodes. This was found to be due to a bug in some routers, but unfortunately many routers have this bug. It is always possible to add or delete nodes without stopping MOSIX: if you do it from a central repository, you need to run mossetpe on All your cluster nodes for the changes to take eect.

8.2.2

Advanced options

The following are advanced options (if no advanced options were previously congured, type + in mosconf). As above, it is not necessary to stop MOSIX for modifying advanced options, just run mossetpe after making the changes from a central repository. Nearby or distant nodes To optimize process migration, for each range of nodes, you can dene whether they are distant or near the nodes that you are conguring. The reason is that when networking is slow, it is better to compress the memory image of migrating processes: it takes CPU time, but saves on network transfer time and volume. If however the nodes are near, it is better not to compress. As a general guideline, specify distant if the network is slower than 1GB/sec, or is 1GB/sec and the nodes are in dierent buildings, or if the nodes are several kilometers away. Outsider nodes For each range of nodes, you can dene whether they are outsiders. Only processes that are allowed to migrate to other clusters in the multi-cluster are allowed to migrate to outsider nodes. This option was intended to allow users to prevent certain programs from migrating to unsuitable computers, such as computers that do not support the full machine instruction-set of their home-node. Aliases Some nodes may have more than one IP address so that network packets that are sent from them to dierent nodes can be seen as arriving from dierent IP addresses. For example, a junction

8.3. CONFIGURING THE MULTI-CLUSTER

node can have a dual function of both being part of a logical MOSIX cluster as well as serve as a router to a physical cluster: nodes inside the physical cluster and outside it may see dierent IP addresses coming from the junction node. In MOSIX, each node must be identied by a unique IP address, so one of the junction-nodes IP addresses is used as its main address, while the others can be congured as aliases: when MOSIX receives TCP/UDP connections from an alias IP address, it recognizes them as actually coming from the main address. Unusual circumstances with IP addresses There are rare cases when the IP address of a node does not appear in the output of ifcong and even more rare cases when more than one IP address that belongs to a node is congured as part of the MOSIX cluster AND appears in the output of ifcong (for example, a node with two Network-Interface-Cards sometimes boots with one, sometimes with the other and sometimes with both, so MOSIX has both addresses congured just in case). When this happens, you need to manually congure the main MOSIX address (using Miscellaneous policies of mosconf).

8.3
8.3.1

Conguring the multi-cluster

Partner-clusters

Now is the time to inform MOSIX which other clusters (if any) are part of your MOSIX multicluster. In a MOSIX multi-cluster, there is no need for each cluster to be aware of all the other clusters, but only of those partner-clusters that we want to send processes to or are willing to accept processes from. You should identify each partner-cluster with a name: usually just one word (if you need to use more, do not use spaces, but - or to separate the words). Note that this name is for your own use and does not need to be identical across the multi-cluster. Next you can add a longer description (in a few words), for better identication.

8.3.2

Which nodes are in a partner-cluster

In most cases, you do not want to know exactly which nodes are in a partner-cluster - otherwise you would need to update your conguration whenever system-administrators make changes to partner-clusters: instead you only need to know about a few nodes (usually one or two are sucient) that belong to each partner-cluster - these are called core-nodes. If possible, choose the core-nodes so that at any given time at least one of them would up and running. There are three methods of determining which nodes are in a partner-cluster: 1. The default and easiest method of operation is to trust the core-nodes to correctly inform your cluster which nodes are in their cluster. 2. MOSIX obtains the list of nodes from the core-nodes, but you also congure a list of allowed nodes. If a core-node informs us that its cluster includes node(s) that are not on our list - ignore them. The result is the intersection of our list and their list. 3. Congure the list of nodes of the partner-cluster locally, without consulting any core-nodes.

CHAPTER 8. CONFIGURATION

Even when trusting the core-nodes, you can still specify particular nodes that you want to exclude. Nodes of partner-clusters are dened by ranges of IP addresses, just like in the local cluster - see above. As above, a few holes are acceptable. For each range of nodes that you dene, you will be asked (the questions are in the singular case if the range is of only one node): 1. Are these core-nodes [Y/n]? 2A. Should these nodes be excluded [y/N]? or for core-nodes: 2B. The following option is extremely rare, but is permitted: are these nodes ONLY used as core nodes, but not as part of {cluster-name} [y/N]? Note: it is permitted to dene nodes that are both core-nodes AND excluded: they tell which nodes are in their cluster, but are not in it themselves. 3. Are these nodes distant [Y/n]? nearby and distant are dened in Section 8.2.2 above. Unlike the single cluster, the default here is distant. Note: all core-nodes must be either nearby or distant, you cannot have both for the same partner.

8.3.3

Partner-cluster relationship

By default, migration can occur in both directions: local processes are allowed to migrate to partner-clusters and processes from partners-clusters are allowed to migrate to the local cluster (subject to priorities, see below). As an option, you can allow migration only in one direction (or even disallow migration altogether if all you want is to be able to view the load and status of the other cluster).

8.3.4

Priorities

Each cluster is given a priority: this is a number between 0 and 65535 (0 is not recommended as it is the local clusters own priority) - the lower it is, the higher the priority. When one or more processes originating from the local cluster, or from partner-clusters of higher priority (lower number), wish to run on a node from our cluster, all processes originating from clusters of a lower priority (higher number) are immediately moved out (evacuated) from this node (often, but not always, back to their home cluster). When you dene a new partner-cluster, the default priority is 50.

8.3.5

Priority stabilization

The following option is suitable for situations where the local node is normally occupied with privileged processes (either local processes, processes from your own cluster or processes from more privileged clusters), but repeatedly becomes idle for short periods. If you know that this is the pattern, you may want to prevent processes from other clusters from arriving during these short gaps when the local node is idle, only to be sent away shortly after. You can dene a minimal gap-period (in seconds) once all higher-privileged processes terminated (or left). During that period processes of less-privileged clusters cannot arrive: use Miscellaneous policies of mosconf to dene the length of this period.

8.4. CONFIGURING THE QUEUING SYSTEM

8.3.6

Maximum number of guests

The maximal number of simultaneous guest-processes from partner-clusters is limited: the default limit is 8 times the number of local processors, but you can change it using Miscellaneous policies of mosconf (note that the number of processes from your own cluster is not limited).

8.4

Conguring the queuing system

MOSIX processes can be queued, so as more processors and memory become available, more new jobs are started.

8.4.1

Queuing is an option

Queuing is an option - if it is not needed, there is no need to congure it. As the system-administrator, it is up to you to set (and enforce) a policy whether or not your users should use queuing, because if some users do not use it, they gain an advantage over the users that do use it. Similarly, you should also set a policy of whether and when users can use priorities other than the default.

8.4.2

Selecting queue-managers

Your rst (and often the only) task, is to select a queue-manager node per cluster. Any node can be a queue manager (it requires very resources), but it is best to select node(s) that are most stable and unlikely to come down. When conguring the nodes in your cluster, mosconf will suggest making the rst node in your cluster the queue-manager: you may accept this suggestion or select a dierent node. Queue-manager nodes are not supposed to be turned o, but if you do need to take down a queue-manager for more than a few minutes (while the rest of your cluster remains operational), you should rst assign another node to take its place as queue-manager. You should be aware that, although no jobs will be lost, rebooting or changing the queue-manager can distort the order of the queue (between jobs that originated from dierent nodes - the order of jobs that originated from the same node is always preserved).

8.4.3

Advanced

Now for the advanced options: Default queuing priority per node The default priority of queued jobs is normally 50 (the lower the better), no matter from which node they originated. If you want jobs that originate from specic nodes to receive a dierent default priority, you can congure that on a per-node basis (but this requires those nodes to have separate MOSIX conguration les). User-ID equivalence It is assumed that the user-IDs are identical in all the nodes of a cluster: this allows the user to cancel or modify the priority of their jobs from any node (of the same cluster) - not just the one from which they started their job. Otherwise (if user-IDs are not identical), you must congure

CHAPTER 8. CONFIGURATION

that fact. Note that in such a case, users will only be able to control their jobs from the node where they started them. Limiting the number of running jobs You can x a maximal number of queued jobs that are allowed to run at any given time - even when there are sucient resources to run more processes. Limiting the total number of queued jobs per user You can x a maximal number of queued jobs that are allowed to run simultaneously per user. This number includes previously-queued jobs that are still running. This parameter is per-node (jobs submitted on another node are counted on the other node). The Super-User is exempt from this limitation. Target processes per processor You can request MOSIX to attempt to run X queued jobs per processor at any given time, instead of the default of 1. The range is 1 to 8. Provision for urgent jobs You can assign an additional number of urgent jobs (priority-0, the highest possible priority) to run regardless of the available resources and other limitations. If you want to use this option, you rst need to discuss with your users which jobs should be considered as urgent. It is then your responsibility to ensure that at any given time, running those additional urgent jobs will in fact have sucient memory/swap-space to proceed reasonably. The default is 0 additional jobs and it is highly recommended to keep this number small. Note that if there are more urgent jobs in the queue, those above this congured number will still need to wait in the queue for resources, as usual. Guarantee a number of jobs per-user You can guarantee a small, minimum number of jobs per user to run, if necessary even out of order and when resources are insucient. This, for example, allows users to run and get results from short jobs while very long jobs of other users are running. Along with this option, you usually want to set a memory limit, so jobs that require much memory are not started out of order. Jobs (per user) above this number and jobs that require more memory, will be queued as usual. Note that when users do not specify the memory requirements of their jobs, (using mosrun -m{mb}), their jobs are considered to require no signicant memory, so when using this option you should request your users to always specify their maximum memory-requirement for their queued jobs. Fair-share policy The default queue policy is rst-come-rst-serve, regardless of which users sent the jobs. If you prefer, you may congure a fair-share policy, where jobs (of the same priority) from dierent users are interleaved, with each user receiving an equal share. If you want to grant dierent

8.5. CONFIGURING THE FREEZING POLICIES

share to certain users, read the section about Fair-share policy in the MOSIX manual (man mosix).

8.5
8.5.1

Conguring the freezing policies

Overview

When too many processes are running on their home node, the risk is that memory will be exhausted, the processes will be swapped out and performance will decline drastically. In the worst case, swap-space may also exhausted and then the Linux kernel will start killing processes. This scenario can happen for many reasons, but the most common one is when another cluster shuts down, forcing a large number of processes to return home simultaneously. The MOSIX solution is to freeze such returning processes (and others), so they do not consume precious memory, then restart them again later when more resources become available. Note that this section only deals with local processes: guest processes are not subject to freezing because at any time when the load rises, they can instead simply migrate back to their home-nodes (or elsewhere). Every process can be frozen, but not every process can be frozen and restarted safely without ill side eects. For example, if even one among communicating parallel processes are frozen, all the others also become blocked. Other examples of processes that should not be frozen, are processes that can time-out or provide external services (such as over the web). While both the user and the system-administrator can freeze any MOSIX process manually at any time (using migrate {pid} freeze), below we shall discuss how to set up a policy for automatic freezing to handle dierent scenarios of process-ooding.

8.5.2

Process classes

The freezing policies are based on process-classes: Each MOSIX process can be assigned to a class, using the mosrun -G{class} option. Processes that do not use this option are of class 0 and cannot migrate outside their cluster, hence the main cause for ooding is eliminated. Common MOSIX processes are run with mosrun -G, which brings them into the default, class 1. As the need arises, you should identify with your users dierent classes of applications that require dierent automatic-freezing policies. Example 1: if some of your users run parallel jobs that should not be frozen, you can assign for them a specic class-number (for example 20), and tell them: in this case, use mosrun -G20, then as the system-administrator make sure that no freezing-policy is dened for class #20. Example 2: if a certain user has long batch jobs with large memory demands, you can assign a dierent class number (for example 8), and tell them: for those batch jobs, use mosrun -G8, then as the system-administrator create a freezing policy for class #8 that will start freezing processes of this class earlier (when the load is still relatively lower) than processes of other classes.

8.5.3

Freezing-policy details

In this section, the term load refers to the local node. The policy for each class that you want to auto-freeze consists of:

CHAPTER 8. CONFIGURATION The Red-Mark: when the load reaches above this level, processes (of the given class) will start to be frozen until the load drops below this mark. The Blue-Mark: when the load drops below this level, processes start to un-freeze. Obviously the Blue-Mark must be signicantly less than the Red-Mark. Home-Mark: when the load is at this level or above and processes are evacuated from other clusters back to their home-node, they are frozen on arrival (without consuming a signicant amount of memory while migrating). Cluster-Mark: when the load is at this level or above and processes from this home-node are evacuated from other clusters back to this cluster, they are instead brought frozen to their home-node. Whether the load for the above 4 load marks (Red, Blue, Home, Cluster) is expressed in units of processes or in standardized MOSIX load: The number of processes is more natural and easier to understand, but the MOSIX load is more accurate and takes into account the number and speed of the processors: roughly, a MOSIX load unit is the number of processes divided by the number of processors (CPUs) and by their speed relative to a standard processor. Using the MOSIX standardized load is recommended in clusters with nodes of dierent types - if all nodes in the cluster have about the same speed and the same number of processors/cores, then it is recommended to use the number of processes. Whether to keep a given, small number of processes from this class running (not frozen) at any time despite the load. Whether to allow only a maximum number of processes from this class to run (that run on their home-node - not counting migrated processes), freezing any excess processes even when the load is low. Time-slice for switching between frozen processes: whenever some processes of a given class are frozen and others are not, MOSIX rotates the processes by allowing running processes a given number of minutes to run, then freezing them to allow another process to run instead. Policy for killing processes that failed to freeze, expressed as memory-size in MegaBytes: in the event that freezing fails (due to insucient disk-space), processes that require less memory are kept alive (and in memory) while process requiring the given amount of memory or more, are killed. Setting this value to 0, causes all processes of this class to be killed when freezing fails. Setting it to a very high value (like 1000000 MegaBytes) keeps all processes alive. When dening a freezing policy for a new class, the default is: RED-MARK BLUE-MARK HOME-MARK CLUSTER-MARK MINIMUM-UNFROZEN MAXIMUM-RUNNING TIME-SLICE KILLING-POLICY = = = = = = = = 6.0 MOSIX standardized load units 4.0 MOSIX standardized load units 0.0 (eg. always freeze evacuated processes) -1.0 (eg. never freeze evacuated processes) 1 (process) unlimited 20 minutes always

8.6. CONFIGURING PARAMETERS OF MOSRUN

8.5.4

Disk-space for freezing

Next, you need inform MOSIX where to store the memory-image of frozen processes, which is congured as directory-name(s): the exact directory name is not so important (because the memory-image les are unlinked as soon as they are created), except that it species particular disk partition(s). The default is that all freeze-image les are created in the directory (or symbolic-link) freeze (please make sure that it exists, or freezing will always fail). Instead, you can select a dierent directory(/disk-partition) or up to 10 dierent directories. If you have more than one physical disk, specifying directories on dierent disks can help speeding up freezing by writing the memory-image of dierent processes in parallel to dierent disks. This can be important when many large processes arrive simultaneously (such as from other clusters that are being shut-down). You can also specify a probability per directory (eg. per disk): This denes the relative chance that a freezing process will use that directory for freezing. The default probability is 1 (unlike in statistics, probabilities do not need to add up to 1.0 or to any particular value). When freezing to a particular directory (eg. disk-partition) fails (due to insucient space), MOSIX will try to use the other freezing directories instead, thus freezing fails only when all directories are full. You can specify a directory with probability 0, which means that it will be used only as a last resort (it is useful when you have faster and slower disks).

8.5.5

Ownership of freezing-les

Freezing memory-image les are usually created with Super-User (root) privileges. If you do your freezing via NFS (it is slow, but sometimes you simply do not have a local disk), some NFS servers do not allow access to root: if so, you can select a dierent user-name, so that memory-image les will be created under its privileges.

8.6

Conguring parameters of mosrun

Some system-administrators prefer to limit what their users can do, or at least to set some defaults for their less technically-inclined users. You can control some of the options of mosrun by using the Parameters of mosrun option of mosconf. The parameters you can control are: 1. Queuing users jobs: either let the user decide; silently make queuing (mosrun -q) the default; or enforce queuing so that ordinary users cannot circumvent the queue. 2. Selection of the best node to start on: either let the user decide; make mosrun -b the default when no other location parameter is specied; or force the mosrun -b option on all ordinary users. 3. Handling of unsupported system-calls: either leave the default of killing the process if an unsupported system-call is encountered (unless the user species mosrun -e or mosrun -w; making mosrun -e the default; or making mosrun -w the default. 4. Whether or not to make the mosrun -mmb parameter mandatory: this may burden users, but it can help protecting your computers against memory/swap exhaustion and even loss or processes as a result. To begin with, no defaults or enforcement are active when MOSIX is shipped.

CHAPTER 8. CONFIGURATION

8.7

Conguring MOSRC (MOSIX Reach the Clouds)

Conguring MOSRC is all about security - protecting your computer(s) from being broken into by MOSRC jobs from other computers. Note that the security level of MOSRC is not as high as the rest of MOSIX and is designed only to protect your computer(s) from ordinary users - not against sophisticated intruders that use reverse-engineering techniques (if this concerns you, do not congure MOSRC) As the system-administrator you must determine the following questions: 1. Which nodes/computers may run MOSRC jobs on your computer(s). 2. Which users and groups may run MOSRC jobs on your computers (from the above nodes). 3. Which directories may callers override with directories from their calling nodes. 4. Where in the le-system may callers create empty directories.

8.7.1

Which nodes

Clearly, not just everyone from the Internet may run jobs on your computer, so you must list IP addresses (single or in ranges) of trusted callers. When specifying a large range of IP addresses of valid callers, you may also exclude from this list specic IP addresses, or sub-ranges of IP addresses. A general security policy applies to all listed IP addresses. You can also set up specic/alternative security policies for a specic IP address or a specic range of IP addresses. Each security policy consists of the combination of items described in the next three sections.

8.7.2

Which users and groups

You can list specic users, or user-IDs that are allowed to run MOSRC jobs on your computer(s). You may also map each user from the calling user-ID to a local user-ID (this is particularly important when the calling nodes/computers and your computer do not share the same user-ID scheme). You can also allow all other users to run MOSRC jobs, either using their original user-ID, or any local user-ID of your choice (the user nobody is probably the safest choice). If so, you have the option of blocking certain users. mosconf provides a quick option to exclude all system-users (such as root, bin, news, all with user-IDs under 1000). It is a good practice, when possible, to let dierent users run with dierent user-IDs otherwise they can interfere with and even kill each others jobs. Similarly, the above discussion applies to user-groups as well.

8.7.3

Prohibited directories

In MOSRC, jobs run in a hybrid environment: some of their directory-trees are overridden (similar to being mounted) by directories from their calling computer. In this section, you list prohibited directories, which callers are not allowed to override. When prohibiting a directory, you automatically prohibit all its parent directories, and you can also determine whether or not to prohibit all its sub-directories as well. For correct operation, MOSRC prohibits the following directories:

8.7. CONFIGURING MOSRC (MOSIX REACH THE CLOUDS) / /proc /sys /dev /mosix (the root) and all its subdirectories and all its subdirectories and all its subdirectories and all its subdirectories

While not automatically enforced by MOSRC, it is strongly recommended to also prohibit the etc directory (otherwise callers can use their own etc/passwd and easily gain super-user access on your computer). It is also strongly recommended to prohibit all system libraries, such as lib and usr/lib (otherwise, callers can gain control by running setuid programs with their own library-functions). Similarly, if you have users with publically-accessible setuid programs and libraries, you should protect them by prohibiting their home-directories.

8.7.4

Empty directories

Callers often want to use, and therefore override, directory-pathnames that exist on their computer, but not on yours. Further two assumptions are that they otherwise have no permission to create new directory(s) with the given pathname(s) on your computer(s), and that the pathname(s) are not prohibited by the previous section. The question is whether and where to allow callers to create for that purpose new (and empty) directories on your computer. This section lists directories where callers can create subdirectories (to any depth). Such directories, if created, will belong to root and remain empty of les (but may have sub-directories that were created in the same way). Empty directories generally do not pose a security threat, but you should consider the risk of deliberate attackers lling your disk with empty directories, thereby preventing your users from creating new les, and making you work hard to clean up those directories, so if this can happen in your environment, do not allow this to happen on le-systems where you cannot aord. It is quite safe to allow empty directories to be created in tmp or var/tmp, because those will eventually be auto-removed.

8.7.5

Predened security levels

For your convenience, instead of conguring the security-policy manually, you can also choose from three predened security policies: Low security: All callers can run with their own user/group-ID. No directories are prohibited (other than those built-into MOSRC). Empty directories may be created anywhere. Medium security: System-callers (such as root, bin, daemon, etc.) run as user/group nobody. Other callers run with their own user/group-ID. The following directories and all their sub-directories are prohibited: etc, lib, lib64, usr/lib, usr/local/lib, usr/X11R6/lib, share, emul. Empty directories may be created anywhere. Top security; All callers run as user nobody and group nobody. The following directories and all their sub-directories are prohibited: etc, usr, lib, lib64, emul, media. Empty directories may be created in tmp, var/tmp and guest.

8.7.6

On the launching side

No conguration is necessary on the launching side, but if you want you can grant some of your users permission to present their MOSRC jobs as coming from dierent user(s). You do this

CHAPTER 8. CONFIGURATION

by editing the le etc/mosix/mrc users, where each line contains a user-name or a user-ID (better to use numeric IDs where possible) terminated by a colon (:) and followed by one or more (space-separated) other user-names or numeric user-IDs (preferred) which the given user is permitted to present their MOSRC jobs as. Similarly, you can edit the le etc/mosix/mrc groups to grant permissions for user-groups to present their MOSRC jobs as coming from dierent group(s). This feature can be useful in allowing system-administrators of dierent computers/clusters to cooperate and set up detailed MOSRC permission schemes independently of their local user/group-ID settings.

Chapter 9

Storage allocation
9.1 Swap space

As on a single computer, you are responsible to make sure that there is sucient swap-space to accommodate the memory demands of all the processes of your users: the fact that processes can migrate does not preclude the possibility of them arriving at times back to their home-node for a variety of reasons: please consider the worst-case and have sucient swap-space for all of them. You do not need to take into account batch jobs that are sent to other nodes in your cluster.

9.2

MOSIX les

During the course of its operation, MOSIX creates and maintains a number of small les in the directory etc/mosix/var. When there is no disk-space to create those les, MOSIX operation (especially load-balancing and queuing) will be disrupted. When MOSIX is installed for the rst time (or when upgrading from an older MOSIX version that had no etc/mosix/var), you are asked whether you prefer etc/mosix/var to be a regular directory or a symbolic link to var/mosix. However, you can change it later. Normally the disk-space in the root partition is never exhausted, so it is best to let etc/mosix/var be a regular directory, but some diskless cluster installations do not allow modications within etc: if this is the case, then etc/mosix/var should be a symbolic link to a directory on another partition which is writable and have the least chance of becoming full. This directory should be owned by root, with chmod 755 permissions and contain a sub-directory multi/.

9.3

Freezing space

MOSIX processes can be temporarily frozen for a variety of reasons: it could be manually using the command: migrate {pid} freeze (which as the Super-User you can also use to freeze any users processes), or automatically as the load increases, or when evacuated from another cluster. In particular, when another cluster(s) shuts down, many processes can be evacuated back home and frozen simultaneously. Frozen processes keep their memory-contents on disk, so they can release their main-memory image. By default, if a process fails to write its memory- contents to disk because there is insucient space, that process is killed: this is done in order to save the system from lling 39

CHAPTER 9. STORAGE ALLOCATION

up the memory and swap-space, which causes Linux to either be deadlocked or start killing processes at random. As the system-administrator, you want to keep the killing of frozen processes only as the last resort: use either or both of the following two methods to achieve that: 1. Allocate freezing directory(s) on disk partitions with sucient free disk-space: freezing is by default to the freeze directory (or symbolic-link), but you can re-congure it to any number of freezing directories. 2. Congure each class of processes that are automatically frozen so processes of that class are not killed when freeze-space is unavailable unless their memory-size is extremely big (specify that threshold in MegaBytes - a value such as 1000000MB would prevent killing altogether).

9.4

Private-le space

MOSIX users have the option of creating private les that migrate with their processes. If the les are small (up to 10MB per process) they are kept in memory - otherwise they require backing storage on disk and as the system-administrator it is your responsibility to allocate sucient disk-space for that. You can set up to 3 dierent directories (therefore up to 3 disk partitions) for the private les of local processes; guest processes from the same cluster; and guest processes from other clusters. For each of those you can also dene a per-process quota. When a guest process fails to nd disk-space for its private les, it will transparently migrate back to its home-node, where it is more likely to nd the needed space; but when a local process fails to nd disk-space, it has nowhere else to go, so its write() system-call will fail, which is likely to disrupt the program. Eorts should therefore be made to protect local processes from the risk of nding that all the disk-space for their private les was already taken by others: the best way to do it is to allocate a separate partition at least for local processes (by default, space for private les is allocated in private for both local and guest processes). For the same reason, local processes should usually be given higher quotas than guest processes (the default quotas are 5GB for local processes, 2GB for guests from the cluster and 1GB for guests from other clusters).

Chapter 10

Managing jobs
As the system administrator you can make use of the following tools:

10.1

Monitoring (mosmon)

mosmon (man mosmon): monitor the load, memory-use and other parameters of your MOSIX cluster or even the whole multi-cluster.

10.2

Listing MOSIX processes (mosps)

mosps (man mosps): view information about current MOSIX processes. In particular, mosps a shows all users, and mosps -V shows guest processes. Please avoid using ps because each MOSIX process has a shadow son process that ps will show, but you should only access the parent, as shown by mosps.

10.3

Controlling running processes (migrate)

migrate (man migrate): you can manually migrate the processes of all users - send them away; bring them back home; move them to other nodes; freeze; or unfreeze (continue) them, overriding the MOSIX system decisions as well as the placement preferences of users. Even though as the Super-User you can technically do so, you should never kill (signal) guest processes. Instead, if you nd guest processes that you dont want running on one of your nodes, you can use migrate to send them away (to their home-node or to any other node).

10.4

Viewing and controlling queued processes (mosq)

mosq (man mosq): list the jobs waiting on the MOSIX queue and possibly modify their priority or even start them running out of the queue.

10.5

Controlling the MOSIX node (mosctl)

mosctl (man mosctl): This utility provides a variety of functions. The most important are: 41

CHAPTER 10. MANAGING JOBS

mosctl stay - prevent automatic migration away from this node. (mosctl nostay to undo). mosctl lstay - prevent automatic migration of local processes away from this node. (mosctl nolstay to undoa.) mosctl block - do not allow further migrations into this node. (mosctl noblock to undo). mosctl bring - bring back all processes whose home-node is on this node. You would usually combine it with using mosctl lstay rst. mosctl expel - send away all guest processes. You would usually combine it with using mosctl block rst. mosctl shutdown - disconnect this node from the cluster. All processes are brought back home, guest processes expelled and the node is isolated from its cluster (and the multicluster). mosctl isolate - isolate the node from the multi-cluster (but not from its cluster) (mosctl rejoin to undo) mosctl cngpri {partner} {newpri} - modify the guest-priority of another cluster in the multi-cluster (the lower the better). mosctl localstatus - check the health of MOSIX on this node.

10.6

If you wish to limit what users can run

Some installations want to restrict access to mosrun or force its users to comply with a local policy by using (or not using) some of mosruns options. For example: Force users to use queuing. Disallow (most) users to queue their jobs with a higher priority. Force users to specify how much memory their program needs. Limit the number of mosrun jobs that a user can run simultaneously (or per day). Log all calls to mosrun by certain users. Limit certain users to run only in their local cluster, but not in their multi-cluster (using the G parameter). Force users to use job-IDs from a certain range. etc. Here is a technique that you can use to achieve this: 1. Allocate a special (preferably new) user-group for mosrun (we shall call it mos in this example). 2. Run: chgrp mos /bin/mosrun 3. Run: chmod 4750 /bin/mosrun (steps 2 and 3 must be repeated every time you upgrade MOSIX)

10.6. IF YOU WISH TO LIMIT WHAT USERS CAN RUN

4. Write a wrapper program (we shall call it bin/wrapper in this example), which receives the same parameters as mosrun, checks and/or modies its parameters according to your desired local policies, then executes: bin/mosrun -g {mosrun-parametrs}. Below is the C code of a primitive wrapper prototype that passes its arguments to mosrun without modications: #include <malloc.h> main(int na, char *argv[]) { char **newargs = malloc((na + 2) * sizeof(char *)); int i; newargs[0] = mosrun; newargs[1] = g; for(i = 1 ; i < na ; i++) newargs[i+1] = argv[i]; newargs[i+1] = (char *)0; execv(bin/mosrun, newargs); }

5. chgrp mos /bin/wrapper 6. chmod 2755 /bin/wrapper 7. Tell your users to use wrapper (or any other name you choose) instead of mosrun.

CHAPTER 10. MANAGING JOBS

Chapter 11

Security
11.1 Abuse by gaining control of a node

A hacker that gains Super-User access on any node of any cluster could intentionally use MOSIX to gain control of the rest of the cluster and multi-cluster. Therefore, before joining into a MOSIX multi-cluster, trust needs to be established among the owners (Super-Users) of all clusters involved (but not necessarily among ordinary users). In particular, system-administrators within a MOSIX multi-cluster need to trust that all the other system-administrators have their computers well protected against theft of Super-User rights.

11.2

Abuse by connecting hostile computers

Another risk is of hostile computers gaining physical access to the internal clusters network and masquerading the IP address of a friendly computer, thus pretending to be part of the MOSIX cluster/multi-cluster. Normally within a hardware cluster, as well as within a well-secured organization, the networking hardware (switches and routers) prevents this, but you should especially watch out for exposed Ethernet sockets (or wireless connections) where unauthorized users can plug their laptop computers into the internal network. Obviously, you must trust that the other system-administrators in your multi-cluster maintain a similar level of protection from such attacks.

11.3

Multi-cluster password

Part of conguring MOSIX (Authentication of mosconf) is selecting a multi-clusterprotection key (password), which is shared by the entire multi-cluster. Please make this key highly-secure - a competent hacker that obtains it can gain control over your computers and thereby the entire multi-cluster.

11.4

Organizational multi-cluster

This level of security is usually only achievable within the same organization, hence we use the term organizational multi-cluster, but if it can exist between dierent organizations, nothing else prevents them from sharing a MOSIX multi-cluster. 45

CHAPTER 11. SECURITY

11.5

Batch password

If you intend to run MOSIX batch-jobs, you also need to select batch keys: a client-key and a server-key. These keys should be dierent in each cluster-partition. A node will only provide batch-service to nodes whose client-key is identical to its server-key (and are both present). In the usual case, when you want to allow all your nodes to be both batch-clients and batch-servers, set the same key as both the client-key and the server-key. If, however, you want some nodes to only be clients and others to only be servers, set the client-key on the clients identical to the server-key on the servers, and use no server-key on the clients and no client-key on the servers. Again, please make this key highly-secure.

Part IV Manuals

PART IV

Chapter 12

Manuals
The manuals in this chapter are arranged in 4 sets and are provided for general information. Users are advised to rely on the manuals that are provided with their specic MOSIX distribution.

12.1

For users

mosbatch - running MOSIX non-migratable batch programs mosbestnode - select the best node to run on mosmigrate - manual control of running MOSIX processes mosmon - MOSIX monitor moskillall - kill or signal all your MOSIX processes mospipe - run pipelined jobs eciently using Direct Communication mosps - list information about MOSIX processes mosq - MOSIX queue control mosrun - running MOSIX programs mostestload - MOSIX test program mostimeof - report CPU usage of migratable processes

12.2

For programmers

direct communication - migratable sockets between MOSIX processes mosix - sharing the power of clusters and multi-clusters

12.3

For administrators

mosctl - miscellaneous MOSIX functions mossetpe - congure the local cluster

12.4

Special package

MOSRC - MOSIX Reach the Clouds

MOSBATCH (M1)

MOSIX Commands

MOSBATCH (M1)

NAME MOSBATCH - Running MOSIX non-migratable batch jobs SYNOPSIS mosbatch [options] program [args] . . . mosbatch -S{maxjobs} [options] {commands-file} [,{failed-file}] Options: [-r{host} | -{a.b.c.d} | -{n} | -h | -b | -jID1-ID2[,ID3-ID4]...] [-F] [-D{DD:HH:MM}] [-{q|Q}[{pri}|n]] [-P{parallel_processes}] [-J{JobID}] [-m{mb}] [-z] [-E{/cwd}] [-i] mosbatchenv { same-arguments-as-mosbatch } DESCRIPTION Mosbatch runs batch, non-migratable jobs on MOSIX cluster-nodes. "Non-migratable" means that the job begins, stays and ends on the same node, typically a computer other than their launching-node (where mosbatch is run): for migratable programs, see mosrun(1). Batch jobs are connected to their launching-node through their standard-input, standard-output and standarderror (le-descriptors 0, 1 and 2) and receive signals from their launching-node (including terminal interrupt/quit/suspend keystrokes). The following arguments are available: Attempt to run on the best available node. (no need to specify because this is the default anyway!) -r{hostname} Run on the given host -{a.b.c.d} Run on the given IP address -{n} Run on the given MOSIX node-number -h Run in the launching-node -jID1-ID2[,ID3-ID4]... Select where to run at random from the given list of hosts, IP-addresses and/or MOSIX node numbers. -b -m{mb} Specify the maximum amount of memory (in Megabytes) that the job requires. As a result: 1. When the default -b argument is selected, mosbatch will only consider running the job on nodes with sufcient avaliable memory and will not begin until at least one such node is found. 2. The common queuing system (see below) will take the jobs memory requirements into account when deciding which and how many jobs to allow to run at any point in time. The system-administrator can make the -m{mb} argument mandatory. If they do, then a -m0 argument is not allowed. -q[{priority}] Queue the job with the common, per-cluster queue. Unqueued jobs start immediately, but note that in some installations, the system-administrators can make queuing the default even without -q, or even make queing mandatory. The MOSIX common queue allows your jobs to run only when sufcient resources become available in the cluster. Please note that the common queue can handle only up to a limited number of jobs, typically a few 1000s. The system-administrator may also limit the number of queued jobs per user. To queue more

January 2012

MOSIX

MOSBATCH (M1)

MOSIX Commands

MOSBATCH (M1)

jobs simultaneously, use private queues (see the -S argument, below). Queued jobs can also be controlled using mosq(1). The letter q may optionally be followed by a non-negative integer, specifying the jobss priority, for example -q50: the lower the number, the higher the priority. The default priority is usually 50 (but the system-administrator may vary it). While being queued, mosps(1) and ps(1) display the command of queued jobs as "mosqueue". -Q[{priority}] Similar to -q, except that if the job is aborted from the queue (using mosq(1)), or if the queuing system fails, the job will be run anyway, immediately, instead of being killed. -qn Do not queue. This is useful where the system-administrator made queuing the default (but not where they made queuing mandatory). -P{parallel_processes} When using the common queue, this tells MOSIX that the job will split into a given number of parallel processes (so more resources must be reserved for it). -S{maxjobs} Queue a list of jobs (potentially a very long list) with a private queue. Instead of a program and its arguments, following the list of arguments is a commands-file, containing one command per line (interpreted by the standard shell, bash(1)): all other arguments will apply to each of the command-lines. This option is commonly used to run the same program with many different sets of arguments. For example, the contents of commands-file could be: my_program -a1 < ile1 > output1 my_program -a2 < ile2 > output2 my_program -a3 < ile3 > output3 Command-lines are started in the order they appear in commands-file. While the number of command-lines is unlimited, mosbatch will run concurrently up to maxjobs (1-30000) command-lines at any given time: when any command-line terminates, a new command-line is started. Lines should not be terminated by the shells background ampersand sign ("&"). As bash spawns a sonprocess to run the command when redirection is used, when the number of processes is an issue, it is recommended to prepend the keyword exec before each command line that uses redirection. For example: exec my_program -a1 < ile1 > output1 exec my_program -a2 < ile2 > output2 exec my_program -a3 < ile3 > output3 The exit status of mosbatch -S{maxjobs} is the number of command-lines that failed (255 if more than 255 command-lines failed). As a further option, the commands-file argument can be followed by a comma and another le-name: commands-file,failed-commands. Mosbatch will then create the second le and write to it the list of all the commands (if any) that failed (this provides an easy way to re-run only those commands that failed). The -S{maxjobs} argument combines well with the common queue (the -q argument): it sets an absolute upper limit on the number of simultaneous jobs whereas the number of jobs allowed to run by the common queue depends on the available cluster resources.

MOSIX

January 2012

MOSBATCH (M1)

MOSIX Commands

MOSBATCH (M1)

-F Run the job even if the requested node is unavailable (otherwise, an error-message will be displayed and the job will not start). -z The programs arguments begin at argument #0 (usually the arguments, if any, begin at argument #1 and argument #0 is assumed to be identical to the program-name). -J{JobID} Associate several instances of mosbatch with a single "job" ID for collective identication and manipulation by the mosq(1), mosps(1) and moskillall(1) utilities. What a "job" means is left at each users discretion. Job-IDs can be either a non-negative integer or a token from the le $HOME/.jobids: if this le exists, each line in it contains a number (JobID) followed by a token that can be used as a synonym to that JobID. Using -J on its own implies -J0. -D{timespec} provide an estimate on how long the job should run. The only effect of this option is to be able to view the estimated remaining time using mosps -D (See mosps(1)). timespec can be specied in any of the following formats (DD/HH/MM are numeric for days, hours and minutes respectively): DD:HH:MM; HH:MM; DDd; HHh; MMm; DDdHHhMMm; DDdHHh; DDdMMm; HHhMMm. -E{/current-directory} Run with the given current-directory (otherwise, the job will run in the same current-directory on the node where they run as the path-name of their current directory on their launching-node). -i Grant the job exclusive use of its standard-input. This is especially recommended when the job gets its input from a le. Jobs that use poll(2) or select(2) to check for input before reading from their standard-input can only work correctly with the -i ag. This ag can also improve the performance. However, the -i ag should not be used when an interactive shell places a batch job in the background (because keyboard input may be intended for the shell itself). The variant mosbatchenv is used to circumvent the loss of certain environment variables by the GLIBC library due to the fact that mosbatch is a "setuid" program: if your program relies on the settings of dynamic-linking environment variables (such as LD_LIBRARY_PATH) or malloc(3) debugging (MALLOC_CHECK_), use mosbatchenv instead of mosbatch. SEE ALSO mosrun(1), mosq(1), mosps(1), moskillall(1), mosix(7).

January 2012

MOSIX

MOSBESTNODE (M1)

MOSIX Commands

MOSBESTNODE (M1)

NAME MOSBESTNODE - Select best node to run on SYNOPSIS mosbestnode [-u] [-n] [-G] [-w] [-m{mb}] DESCRIPTION Mosbestnode selects the best node to run a new job. Mosbestnode normally prints the selected nodes IP address. If the -u argument is used and the node has an associated MOSIX node number, mosbestnode prints its MOSIX node number instead. The selection is normally for the immediate sending of a job to the selected node by means other than mosrun(1) and mosbatch(1) (such as "rsh", "ssh", or MPI). MOSIX is updated to assume that a new process will soon start on the selected node: when calling mosbestnode for any other purpose (such as information gathering), use the -n argument, to prevent misleading MOSIX as if a new process is about to be started. The -G argument widens the node selection to the whole multi-cluster - otherwise, only nodes within the local cluster are chosen. The -m{mb} argument narrows the selection to nodes that have at least mb Megabytes of free memory. When the -w argument is used, mosbestnode waits until an appropriate node is found: otherwise, if no appropriate node is found, mosbestnode prints "0" and exits. SEE ALSO mosrun(1), mosix(7).

January 2012

MOSIX

MOSMIGRATE (M1)

MOSIX Commands

MOSMIGRATE (M1)

NAME MOSMIGRATE - Manual control of running MOSIX processes SYNOPSIS mosmigrate mosmigrate mosmigrate mosmigrate mosmigrate mosmigrate mosmigrate mosmigrate

{{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}} {{pid}|-j{jobID}}

{node-number|IP-address|host} home out freeze continue checkpoint chkstop chkexit

DESCRIPTION Mosmigrate {pid} manually migrates or otherwise affects a given migratable process (pid). Mosmigrate -j{jobID} does the same to all of the users processes with the given jobID (see mosrun(1)). The rst option ({node-number|IP-address|host}) species a recommended target node to which to migrate the process(es). Note that no error is returned if MOSIX ignores this recommendation. The home option forces the process(es) to migrate back to its home-node. The out option forces a guest process(es) to move out of this node (this option is available only to the Super-User, The freeze option freezes the process(es) (guest processes may not be frozen). The continue option unfreeze the process(es). The checkpoint option requests the process(es) to take a checkpoint. The chkstop option requests the process(es) to take a checkpoint and stop: the process(es) may then be resumed with SIGCONT. The chkexit option requests the process(es) to take a checkpoint and exit. Mosmigrate sends the instruction, but does not wait until the process(es) respond to it (or reject it). Except for the Super-User, one can normally migrate or affect only their own processes. The rule is: if you can kill it, you are also allowed to migrate/affect it. The best way to locate and nd the PID of MOSIX processes is by using mosps(1). SEE ALSO mosrun(1), mosps(1), mosix(7).

January 2012

MOSIX

MOSMON (M1)

MOSIX Commands

MOSMON (M1)

NAME mosmon MOSIX monitor SYNOPSIS mosmon [ v | V | w ] [ t ] [ d ] [ s | m | m m | f | c | l ] DESCRIPTION Mosmon displays useful current information about MOSIX nodes. The information is represented as a barchart, which allows a comparison between different nodes. The display shows nodes that are assigned node numbers in /etc/mosix/userview.map (see mosix(7)). The following options are available: w v V t d s m m m Display the used vs. total swap space. f l Display the number of frozen processes. Display the load (default). Select horizontal numbering: better display - but less nodes are visible. Select vertical numbering: more nodes are visible - but denser display. Select tight vertical numbering: maximal number of nodes are visible - but even denser display. Display the number of operational nodes and number of CPUs. Display also dead (not responding) nodes. Display the CPU-speeds. Display the used vs. total memory.

While in mosmon, the following keys may be used: v V w a s m m m Select vertical numbering. Select tight vertical numbering. Select horizontal numbering. Select automatic numbering (vertical numbering will be selected if it would make the difference between viewing all nodes or not - otherwise, horizontal numbering is selected). Display CPU-speeds and number of nodes (if more than 1). 10000 units represent a standard 3GHz processor. Display used memory vs. total memory: the used memory is displayed as a solid bar while the total memory is extended with + signs. The memory is shown in Megabytes. (pressing m twice) Display used swap-space vs. total swap-space: the used swap-space is displayed as a solid bar while the total swap-space is extended with + signs. Swap-space is shown in Gigabytes and is accurate to 0.1GB. Display the number of frozen processes. Display loads. The load represents one CPU-bound process that runs on a node with a single CPU under normal conditions. The load increases proportionally on slower CPUs, and decreases proportionally on faster CPUs and on nodes with more than one CPU.

f l

May 2013

MOSIX

MOSMON (M1)

MOSIX Commands

MOSMON (M1)

d D t

Display also dead (not-responding) nodes. Stop displaying dead nodes. Toggle displaying the count of operational nodes.

Right-Arrow Move one node to the right (when not all nodes t on the screen). Left-Arrow Move one node to the left (when not all nodes t on the screen). n p h Move one screen to the right (when not all nodes t on one screen). Move one screen to the left (when not all nodes t on one screen). Bring up a help screen.

Enter Redraw the screen. q Quit Mosmon.

[-j] [-j] [-j] [-j] [-j] [-j] [-j] [-j] [-j]

[-p] list [-p] listall [-p] locallist [-p] locallistall run {pid|jobID} [{hostname}|{IP}|{node-number}] abort {pid|jobID} [{hostname}|{IP}|{node-number}] cngpri {newpri} {pid|jobID} [{hostname}|{IP}|{node-number}] advance {pid|jobID} [{hostname}|{IP}|{node-number}] retard {pid} [{hostname}|{IP}|{node-number}]

DESCRIPTION Mosq displays and controls the content of the common queue - e.g., programs and jobs that were submitted using mosrun -q or mosbatch -q. mosq list displays an ordered table of all queued programs and jobs: their process-ID; user-name; memory requirement (if any); whether conned to the local cluster or allowed to use other clusters in the multicluster; their priority (the lower the better); the node where they were launched; and the command line (when available). mosq listall is similar to list, except that it also shows programs and jobs that were once queued and are now running. For those, the PRI eld shows "RUN" instead of a priority. mosq locallist is similar to list, but displays only programs and jobs that were launched on the local node. The FROM eld is not shown; and unlike list, the order of programs and jobs in locallist (within each priority and unless affected by the actions below), is according to the submission time of the programs and jobs and not their actual place in the queue. mosq locallistall is similar to locallist, except that it also shows programs and jobs that were once queued and are now running. While list and listall may be blocked when the per-cluster node that is responsible for queuing is inaccessible, locallist and locallistall can not be blocked because they depend only on the local node. The -p argument adds the number of parallel processes ("NPROC") to the listing. When the -j argument is used in conjunction with list, listall, locallist or locallistall, the Job-ID eld is included in the listing (it is assigned by mosrun(1) and mosbatch(1) using the "mosrun -J{jobID}" or "mosbatch -J{jobID}" argument). The following commands operate on selected programs or jobs from the queue: when the -j argument is not specied, a single program or job is selected by its process-ID and launching node, but when the -j argument is specied, all programs and jobs with the same User-ID as the caller, and the given Job-ID and launching node, are selected. The launching node can be specied as either an IP address, a host-name, a MOSIX logical node-number, or omitted if the programs(s)/job(s) were launched from the current node. mosq cngpri modies the priority of the selected program(s)/job(s): the lower the [non-negative] number - the higher the priority. mosq run force the release of the selected program(s)/job(s) from the queue and cause them to start running (regardless of the available cluster/multi-cluster resources).

January 2012

MOSIX

MOSQ (M1)

MOSIX Commands

MOSQ (M1)

mosq abort removes the selected program(s)/job(s) from the queue, normally killing them (but if they were started by "mosrun -Q" or "mosbatch -Q", they will start running instead). mosq advance move the selected program(s)/job(s) forward in the queue, making them the rst among the queued programs/jobs with the same priority. mosq retard move the selected program(s)/job(s) backward in the queue, making them the last among the queued program(s)/jobs with the same priority. SYNONYMS The following synonyms are provided for convenience and may be used interchangeably: locallist - listlocal locallistall - listalllocal; listlocalall cngpri - changepri; newpri run - launch; release; activate abort - cancel; kill; delete SEE ALSO mosrun(1), mosbatch(1), mosix(7).

MOSIX

January 2012

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

NAME MOSRUN - Running MOSIX programs SYNOPSIS mosrun [location_options] [program_options] program [args] . . . mosrun -S{maxjobs} [location_options] [program_options] {commands-file} [,{failed-file}] mosrun -R{filename} [-O{fd=filename}[,{fd2=fn2}]]... [location_options] mosrun -I{filename} mosenv { same-arguments-as-mosrun } mosnative program [args]... Location options: [-r{host} | -{a.b.c.d} | -{n} | -h | -b | -jID1-ID2[,ID3-ID4]... }] [-G[{class}]] [-L] [-l] [-{q|Q}[{pri}|n]] [-P{parallel_processes}] [-F] [-J{JobID}] [-A{minutes}] [-N{max}] [-D{DD:HH:MM}] Program Options: [-m{mb}] [-e] [-w] [-u] [-t] [-T] [-M[/{cwd}]] [-i] [-C{filename}] [-z] [-c] [-n] [-d {0-10000}] [-X{/directory}]...

DESCRIPTION Mosrun runs migratable programs that can potentially migrate to other nodes within a MOSIX(7) cluster (or multi-cluster). All child processes of migratable programs can migrate as well, independently of their parent. Migratable programs act strictly as if they were running on their home-node - which is usually the computer from which mosrun is launched. Specically, all their le and socket operations (opening, reading, writing, sending, receiving) is performed on their home-node, so even when a migrated process is running on another node, it uses the network to perform the operation on the home-node, then returns the results to the program. Certain Linux features, such as shared-memory, are not available to migratable programs (see the LIMITATIONS section below). If you are not interested in process-migration, you may prefer to run batch jobs (see mosbatch(1)). Following are the arguments of mosrun, beginning with where to start the program: -b -r{hostname} -{a.b.c.d} -{n} -h -jID1-ID2[,ID3-ID4]... attempt to run on the best available node on the given host on the given IP address on the given MOSIX node-number in the home-node select at random from the given list of hosts, IP-addresses and/or MOSIX node numbers.

(in some installations, the system-adminstrator may force the -b option, disallowing the rest of the above options)

January 2012

MOSIX

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

-m{mb} Specify the maximum amount of memory (in Megabytes) that the program requires. There are several implications: 1. Combined with the -b ag, the program will only consider to start running on nodes with sufcient memory and will not begin until at least one such node is found. 2. Programs will not automatically migrate to nodes with insufcient available memory (except to their home-node in circumstances when a program must return home). 3. The common queuing system (see below) will take the programs memory requirements into account when deciding which and how many jobs to allow to run at any point in time. The system-administrator can make the -m{mb} argument mandatory. If they do, then a -m0 argument is not allowed. -G[{class}] Allow the program to migrate to nodes in other clusters within the MOSIX multi-cluser. Otherwise, the program will only migrate within the local cluster (where there is only one cluster, -G makes no difference). The letter G may optionally be followed by a positive integer (such as -G15) that specify the programs class for freezing purposes (class numbers may be provided by the system-administrator). -G on its own is equivalent to -G1, while -G0 reverses the effect of -G (this can be used when a migratable program, especially a shell-script, wants to spawn a son-process that should not migrate outside the local cluster). -e Usually, a program that encounters an unsupported feature (see LIMITATIONS below) terminates. This ag allows the program to continue and instead behave as follows: 1. mmap(2) with (ags & MAP_SHARED) - but !(prot & PROT_WRITE), replaces the MAP_SHARED with MAP_PRIVATE (using MAP_SHARED without PROT_WRITE seems unusual or even faulty, but is unnecessarily used within some Linux libraries). 2. all other unsupported system-call return -1 and "errno" is set to ENOSYS. -w Same as -e, but whereas -e is silent, -w causes mosrun to print an error message to the standard-error whenever an unsupported system-call is encountered. -u Reverse the effect of -e or -w: this can be used when a program (especially a shell-script) wants to spawn a son-process that should terminate upon encountering an unsupported feature. Also, as the systemadministrator can make the -e or -w arguments the default, the -u ag can be used to reverse their effect. -L Do not migrate the program automatically. The program may still be migrated manually or when circumstances do not allow it to continue running where it is and force it to migrate back to its home-node. -l reverse the effect of -L and allow automatic migrations. -q[{priority}] Queue the program with the common, per-cluster queue. Unqueued programs start immediately, but note that in some installations, the system-administrators can make queuing the default even without -q, or even make queing mandatory.

MOSIX

January 2012

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

The MOSIX common queue allows your programs to run only when sufcient resources become available in the cluster or multi-cluster. Please note that the common queue can handle only up to a limited number of programs, typically a few 1000s. The system-administrator may also limit the number of queued jobs per user. To queue more programs simultaneously, use private queues (see the -S argument, below). Queued jobs can also be controlled using mosq(1). The letter q may optionally be followed by a non-negative integer, specifying the programs priority, for example -q50: the lower the number, the higher the priority. The default priority is usually 50 (but the system-administrator may vary it). While being queued, mosps(1) and ps(1) display the command of queued programs as "mosqueue". -Q[{priority}] Similar to -q, except that if the program is aborted from the queue (using mosq(1)), or if the queuing system fails, the program will be run anyway, immediately, instead of being killed. -qn Do not queue. This is useful where the system-administrator made queuing the default (but not where they made queuing mandatory). -P{parallel_processes} When using the common queue, this tells MOSIX that the job will split into a given number of parallel processes (so more resources must be reserved for it). -S{maxjobs} Queue a list of jobs (potentially a very long list) with a private queue. Instead of a program and its arguments, following the list of arguments is a commands-file, containing one command per line (interpreted by the standard shell, bash(1)): all other arguments will apply to each of the command-lines. This option is commonly used to run the same program with many different sets of arguments. For example, the contents of commands-file could be: my_program -a1 < ile1 > output1 my_program -a2 < ile2 > output2 my_program -a3 < ile3 > output3 Command-lines are started in the order they appear in commands-file. While the number of command-lines is unlimited, mosrun will run concurrently up to maxjobs (1-30000) command-lines at any given time: when any command-line terminates, a new command-line is started. Lines should not be terminated by the shells background ampersand sign ("&"). As bash spawns a sonprocess to run the command when redirection is used, when the number of processes is an issue, it is recommended to prepend the keyword exec before each command line that uses redirection. For example: exec my_program -a1 < ile1 > output1 exec my_program -a2 < ile2 > output2 exec my_program -a3 < ile3 > output3 The exit status of mosrun -S{maxjobs} is the number of command-lines that failed (255 if more than 255 command-lines failed). As a further option, the commands-file argument can be followed by a comma and another le-name: commands-file,failed-commands. Mosrun will then create the second le and write to it the list of all the commands (if any) that failed (this provides an easy way to re-run only those commands that

January 2012

MOSIX

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

failed). The -S{maxjobs} argument combines well with the common queue (the -q argument): it sets an absolute upper limit on the number of simultaneous jobs whereas the number of jobs allowed to run by the common queue depends on the available multi-cluster resources. -M[{/current-directory}] Run the program with a potentially different home-node. The only connection of programs that run with the -M ag with their launching node is through their standard input, output and error (le-descriptors 0, 1 and 2) and by receiving signals (including terminal interrupt/quit/suspend keystrokes. Note that when such programs have child-processes, signals are sent to the whole process-group). By default, programs that run with the -M ag use the path-name of their current-directory on their launching node as their current-directory on their home-node. A alternate /current-directory may optionally be selected (for example -M/tmp). -i Grant programs that run with the -M ag an exclusive use of their standard-input. This is especially recommended when the program gets its input from a le. Programs that use poll(2) or select(2) to check for input before reading from their standard-input can only work correctly with the -i ag. This ag can also improve the performance. However, the -i ag should not be used when an interactive shell runs a program in the background (because keyboard input may be intended for the shell itself). -F Run the program even if the requested node is unavailable (otherwise, an error-message will be displayed and the program will not start). -t Even when running elsewhere, migratable programs obtain the results of the gettimeofday(2) system-call from their home-nodes. Use this ag to allow your program to obtain the time from the local node where it currently runs instead, thus saving on network delays. Note that this can be a problem when the clocks are not synchronized. -T Reverse the effect of -t, causing the time to be fetched from the home-node. -z The programs arguments begin at argument #0 (usually the arguments, if any, begin at argument #1 and argument #0 is assumed to be identical to the program-name). -C{filename} Select an alternative le-basename for checkpoints (See CHECKPOINTS below). -N{max} Limit the number of checkpoint les (See CHECKPOINTS below). -A{minutes} Perform an automatic checkpoint every given number of minutes (See CHECKPOINTS below). -R{filename} Recover and continue to run from a saved checkpoint le (See CHECKPOINTS below). [-O{fd=filename}[,{fd2=filename2}]]... When using the -R{filename} argument to recover after a checkpoint, replace one or more ledescriptors (See CHECKPOINTS below).

MOSIX

January 2012

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

-I{filename} Inspect a checkpoint le (See CHECKPOINTS below). -X {/directory} Declare a private temporary directory (see PRIVATE TEMPORARY FILES below). -c System calls and I/O operations are monitored and taken into account in automatic migration considerations, tending to pull processes towards their home-nodes. Use this ag if you want to tell mosrun to not take system calls and I/O operations into the migration considerations. -n Reverse the effect of the -c ag: include system-calls and I/O operations into account in automatic migration considerations. -d{decay} Set the rate of decay of process-statistics for automatic migration considerations as a fraction of 10000 per second (see mosix(7)). decay must be an integer between 0 (immediate decay) and 10000 (no decay at all). The default decay is 9976. -J{JobID} Associate several instances of mosrun with a single "job" ID for collective identication and manipulation by the mosq(1), mosmigrate(1), mosps(1) and moskillall(1) utilities. Note that the meaning of "job" is left at each users discretion and bears no relation to mosbatch jobs. Job-IDs can be either a non-negative integer or a token from the le $HOME/.jobids: if this le exists, each line in it contains a number (JobID) followed by a token that can be used as a synonym to that JobID. Using -J on its own implies -J0. Job IDs are inherited by child processes. Job IDs are not supported with the -M option. -D{timespec} provide an estimate on how long the program should run. The only effect of this option is to be able to view the estimated remaining time using mosps -D (See mosps(1)). timespec can be specied in any of the following formats (DD/HH/MM are numeric for days, hours and minutes respectively): DD:HH:MM; HH:MM; DDd; HHh; MMm; DDdHHhMMm; DDdHHh; DDdMMm; HHhMMm. Periods when the process is frozen are automatically added to that estimate. Note that the following arguments may also be changed at run time by the program itself: -m, -G, -e/-w/-u, -L/-l, -c, -t/-T, -C, -N, -A, -c/-n/-d (See mosix(7)). CHECKPOINTS Most CPU-intensive processes running under mosrun can be checkpointed: this means that an image of those processes is saved to a le, and when necessary, the process can later recover itself from that le and continue to run from that point. For successful checkpoint and recovery, the process must not depend heavily on its Linux environment. Specically, the following processes cannot be checkpointed at all: 1. Processes with setuid/setgid privileges (for security reasons). 2. Processes with open pipes or sockets. The following processes can be checkpointed, but may not run correctly after being recovered:

January 2012

MOSIX

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

1. Processes that rely on process-IDs of themselves or other processes (parent, sons, etc.). 2. Processes that rely on parent-child relations (e.g. use wait(2), Terminal job-control, etc.). 3. Processes that coordinate their input/output with other running processes. 4. Processes that rely on timers and alarms. 5. Processes that cannot afford to lose signals. 6. Processes that use system-V IPC (semaphores and messages). The -C{filename} argument species where to save checkpoints: when a new checkpoint is saved, that le-name is given a consecutive numeric extension (unless it already has one). For example, if the argument -Cmysave is given, then the rst checkpoint will be saved to mysave.1, the second to mysave.2, etc., and if the argument -Csave.4 is given, then the rst checkpoint will be saved to save.4, the second to save.5, etc. If the -C argument is not provided, then the checkpoints will be saved to the default: ckpt.{pid}.1, ckpt.{pid}.2 ... The -C argument is NOT inherited by child processes. The -N{max} argument species the maximum number of checkpoints to produce before recycling the checkpoint versions. This is mainly needed in order to save disk space. For example, when running with the arguments: -Csave.4 -N3, checkpoints will be saved in save.4, save.5, save.6, save.4, save.5, save.6, save.4 . . . The -N0 argument returns to the default of unlimited checkpoints; an argument of -N1 is risky, because if there is a crash just at the time when a backup is taken, there could be no remaining valid checkpoint le. Similarly, if the process can possibly have open pipe(s) or socket(s) at the time a checkpoint is taken, a checkpoint le will be created and counted - but containing just an error message, hence this argument should have a large-enough value to accommodate this possibility. The -N argument is NOT inherited by child processes. Checkpoints can be triggered by the program itself, by a manual request (see mosmigrate(1)) and/or at regular time intervals. The -A{minutes} argument requests that checkpoints be automatically taken every given number of minutes. Note that if the process is within a blocking system-call (such as reading from a terminal) when the time for a checkpoint comes, the checkpoint will be delayed until after the completion of that system call. Also, when the process is frozen, it will not produce a checkpoint until unfrozen. The -A argument is NOT inherited by child processes. With the -R{filename} argument, mosrun recovers and continue to run the process from its saved checkpoint le. Program options are not permitted with -R, since their values are recovered from the checkpoint le. It is not always possible (or desirable) for a recovered program to continue to use the same les that were open at the time of checkpoint: mosrun -I{filename} inspects a checkpoint le and lists the open les, along with their modes, ags and offsets, then the -O argument allows the recovered program to continue using different les. Files specied using this option, will be opened (or created) with the previous modes, ags and offsets. The format of this argument is usually a comma-separated list of le-descriptor integers, followed by a = sign and a le-name. For example: -O1=oldstdout,2=oldstderr,5=tmpfile, but in case one or more le-names contain a comma, it is optional to begin the argument with a different separator, for example: -O@1=file,with,commas@2=oldstderr@5=tmpfile. In the absence of the -O argument, regular les and directories are re-opened with the previous modes, ags and offsets. Files that were already unlinked at the time of checkpoint, are assumed to be temporary les belonging to the process, and are also saved and recovered along with the process (an exception is if an unlinked le was opened for write-only). Unpredictable results may occur if such les are used to communicate with other processes. As for special les (most commonly the users terminal, used as standard input, output or error) that were open at the time of checkpoint - if mosrun is called with their le-descriptors open, then the existing open les are used (and their modes, ags and offsets are not modied). Special les that are neither specied in

MOSIX

January 2012

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

the -O argument, nor open when calling mosrun, are replaced with /dev/null. While a checkpoint is being taken, the partially-written checkpoint le has no permissions (chmod 0). When the checkpoint is complete, its mode is changed to 0400 (read-only). PRIVATE TEMPORARY FILES Normally, all les are created on the home-node by migratable programs and all le-operations are performed there. This is important because programs often share les, but can be costly: many programs use temporary les which they never share - they create those les as secondary-memory and discard them when they terminate. It is best to migrate such les with the process rather than to keep them in the home-node. The -X {/directory} argument tells Mosrun that a given directory is only used for private temporary les: all les that the program creates in this directory are kept with the process that created them and migrate with it. The -X argument may be repeated, specifying up to 10 private temporary directories. The directories must start with /; can be up to 256 characters long; cannot include ".."; and for security reasons cannot be within "/etc", "/proc", "/sys" or "/dev". Only regular les are permitted within private temporary directories: no sub-directories, links, symbolic-links or special les are allowed (but sub-directories can be specied by an extra -X argument). Private temporary le names must begin with / (no relative pathnames) and contain no ".." components. The only le operations currently supported for private temporary les are: open, creat, lseek, read, write, close, chmod, fchmod, unlink, truncate, ftruncate, access, stat. File-access permissions on private temporary les are provided for compatibility, but are not enforced: the stat(2) system-call returns 0 in st_uid and st_gid. stat(2) also returns the le-modication times according to the node where the process was running when making the last change to the le. The per-process maximum total size of all private temporary les is set by the system-administrator. Different maximum values can be imposed when running on the home-node, in the local cluster and on other clusters in the multi-cluster - exceeding this maximum will cause a process to migrate back to its home-node. ALTERNATIVE FREEZING SPACE Migratable processes can sometimes be frozen (you can freeze your processes manually and the systemadministrator usually sets an automatic-freezing policy - See mosix(7)). The memory-image of frozen processes is saved to disk. Normally the system-administrator determines where on disk to store your frozen processes, but you can override this default and set your own freezingspace. One possible reason to do so is to ensure that your processes (or some of them) have sufcient freezing space regardless of what other users do. Another possible reason is to protect other users if you believe that your processes (or some of them) may require so much memory that they could disturb other users. Setting your own freezing space can be done either by setting the environment-variable FREEZE_DIR to an alternative directory (starting with /); or if you wish to specify more than one freeze-directory, by creating a le: $HOME/.freeze_dirs where each line contains a directory-name starting with /. For more details, read about "lines starting with /" within the section about conguring /etc/mosix/freeze.conf in the mosix(7) manual. You must have write-access to the your alterantive freeze-directory(s). The space available in alternative freeze-directories is subject to possible disk quotas. RECURSIVE MOSRUN It is possible to invoke mosrun within a program that is already running under mosrun. This is common, for example, within shell-scripts or a Makefile that contains calls to mosrun.

January 2012

MOSIX

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

Unless requesting explicitly otherwise, recursive programs will inherit the following (and only the following) arguments: -c, -d, -e, -J, -G, -L, -l, -m, -n, -T, -t, -u, -w. If you want a program running under mosrun (including a shell or shell-script) to fork a non-migratable child-program, use the utility: mosnative {program} [args]... Mosnative programs are run directly under Linux in their parents home-node and are not subject to the limitations of migratable programs (but cannot migrate to other nodes either). MOSENV The variant mosenv is used to circumvent the loss of certain environment variables by the GLIBC library due to the fact that mosrun is a "setuid" program: if your program relies on the settings of dynamic-linking environment variables (such as LD_LIBRARY_PATH) or malloc(3) debugging (MALLOC_CHECK_), use mosenv instead of mosrun. LIMITATIONS Some system-calls are not supported by migratable programs, including system-calls that are tightly connected to resources of the local node or intended for system-administration. These are: acct, add_key, adjtimex, afs_syscall, bdush, capget, capset, chroot, clock_getres, clock_nanosleep, clock_settime, create_module, delete_module, epoll_create, epoll_create1, epoll_ctl, epoll_pwait, epoll_wait, eventfd, eventfd2, fanotify_init, fanotify_mark, futex, get_kernel_syms, get_mempolicy, get_robust_list, getcpu, getpmsg, init_module, inotify_add_watch, inotify_init, inotify_init1, inotify_rm_watch, io_cancel, io_destroy, io_getevents, io_setup, io_submit, ioperm, iopl, ioprio_get, ioprio_set, kexec_load, keyctl, lookup_dcookie, madvise, mbind, migrate_pages, mlock, mlockall, move_pages, mq_getsetattr, mq_notify, mq_open, mq_timedreceive, mq_timedsend, mq_unlink, munlock, munlockall, nfsservctl, perf_event_open, personality, pivot_root, prlimit64, prof_counter_open, ptrace, quotactl, reboot, recvmmsg, remap_le_pages, request_key, rt_sigqueueinfo, rt_sigtimedwait, rt_tgsigqueueinfo, sched_get_priority_max, sched_get_priority_min, sched_getafnity, sched_getparam, sched_getscheduler, sched_rr_get_interval, sched_setafnity, sched_setparam, sched_setscheduler, security, set_mempolicy, setdomainname, sethostname, set_robust_list, settimeofday, shmat, signalfd, signalfd4, swapoff, swapon, syslog, timer_create, timer_delete, timer_getoverrun, timer_gettime, timer_settime, timerfd, timerfd_gettime, timerfd_settime, tuxcall, unshare, uselib, vmsplice, waitid. In addition, mosrun supports only limited options for the following system-calls: clone The only permitted ags are CLONE_CHILD_SETTID, CLONE_PARENT_SETTID, CLONE_CHILD_CLEARTID, and the combination CLONE_VFORK|CLONE_VM; the child-termination signal must be SIGCLD and the stack-pointer (child_stack) must be NULL. getpriority may refer only to the calling process. ioctl The following requests are not supported: TIOCSERGSTRUCT, TIOCSERGETMULTI, TIOCSERSETMULTI, SIOCSIFFLAGS, SIOCSIFMETRIC, SIOCSIFMTU, SIOCSIFMAP, SIOCSIFHWADDR, SIOCSIFSLAVE, SIOCADDMULTI, SIOCDELMULTI, SIOCSIFHWBROADCAST, SIOCSIFTXQLEN, SIOCSMIIREG, SIOCBONDENSLAVE, SIOCBONDRELEASE, SIOCBONDSETHWADDR, SIOCBONDSLAVEINFOQUERY, SIOCBONDINFOQUERY, SIOCBONDCHANGEACTIVE, SIOCBRADDIF, SIOCBRDELIF. Non-standard requests that are dened in drivers that are not part of the standard Linux kernel are also likely to not be supported.

MOSIX

January 2012

MOSRUN (M1)

MOSIX Commands

MOSRUN (M1)

ipc

the following SYSV-IPC calls are not supported: shmat, semtimedop, new-version calls (bit 16 set in call-number). mmap MAP_SHARED and mapping of special-character devices are not permitted. prctl only the PR_SET_DEATHSIG and PR_GET_DEATHSIG options are supported. setpriority may refer only to the calling process. setrlimit it is not permitted to modify the maximum number of open les (RLIMIT_NOFILES): mosrun xes this limit at 1024. Users are not permitted to send the SIGSTOP signal to programs run by mosrun. SIGTSTP should be used instead (and moskillall(1) changes SIGSTOP to SIGTSTP).

Attempts to run 32-bit programs by mosrun will result in the program running in native mode (as if it was run by mosnative). SEE ALSO mosbatch(1), mosmigrate(1), mosq(1), moskillall(1), mosps(1), direct_communication(7), mosix(7).

January 2012

MOSIX

MOSTESTLOAD (M1)

MOSIX EXTRAS

MOSTESTLOAD (M1)

NAME mostestload - MOSIX test program SYNOPSIS mostestload [OPTIONS] DESCRIPTION A test program that generates articial load and consumes memory for testing the operation of MOSIX. OPTIONS -t{seconds} | --time={seconds} Run for a given number of CPU seconds: the default is 1800 seconds (30 minutes). A value of 0 causes mostestload to run indenitely. OR: -t{min},{max} | --time={min},{max} Run for a random number of seconds between min and max. -m{mb}, --mem={mb} amount of memory to consume in Megabytes (by default, mostestload consumes no signicant amount of memory). --random-mem Fill memory with a random pattern (otherwise, memory is lled with the same byte-value). --cpu={N} When testing pure CPU jobs - perform N units of CPU work, then exit. When also doing system-calls (--read, --write, --noiosyscall) - perform N units of CPU work between chunks of system-calls. --read[={size}[,{ncalls}[,{repeats}]] --write[={size}[,{ncalls}[,{repeats}]] perform read OR write system calls of size KiloBytes (default=1KB). These calls are repeated in a chunk of ncalls times (default=1024), then those chunks are repeated repeats times (default=indenitely), with optional CPU work between chunks if the --cpu option is also set. --noiosyscall={ncalls}[,{repeats}] perform some other system call that does not involve I/O ncalls times (default=1024), repeat this {repeats} times (default=indenitely), with optional CPU work in between if the --cpu option is also set. -d, --dir={directory} -f, --file={filename} select a directory OR a le on which to perform reading or writing (the default is to create a le in the /tmp directory). --maxiosize={SIZE} Once the le size reaches SIZE megabytes, further I/O will resume at the beginning of the le. -v, --verbose produce debug-output. --report-migrations Report when mostestload migrates. -r, --report Produce summary at end of run. --sleep SEC Sleep for SEC seconds before starting -h, --help Display a short help screen.

January 2012

MOSIX

MOSTESTLOAD (M1)

MOSIX EXTRAS

MOSTESTLOAD (M1)

EXAMPLES mostestload -t 20 run CPU for 20 seconds mostestload -l 10 -h 20 runs CPU for a random period of time between 10 and 20 seconds. mostestload -f /tmp/20MB --write 32,640,1 writes 32 KiloBytes of data 640 times (total 20 megabytes) to the le /tmp/20MB. mostestload -f /tmp/10MB --write 32,640 --maxiosize 10 --cpu=20 writes 32 KiloBytes of data 640 times (total 20 megabytes) to the le /tmp/10MB, alternating this indenitely with running 20 units of CPU. The le "/tmp/10MB" is not allowed to grow beyond 10 MegaBytes: once reaching that limit, writing resumes at the beginning of the le. AUTHOR Adapted from code by Lior Amar

MOSIX

January 2012

MOSTIMEOF (M1)

MOSIX Commands

MOSTIMEOF (M1)

NAME MOSTIMEOF - Report CPU usage of migratable processes SYNOPSIS timeof {pid}... DESCRIPTION Mostimeof reports the amount of CPU-time accumulated by one or more MOSIX migratable processes, no matter where they run. Its argument(s) are the process-IDs of processes to inspect. NOTES 1. Mostimeof must run on the process home-node, so if the process was generated by mosrun -M, then mostimeof has to be invoked on the remote home-node. 2. The report is of user-level CPU-time: system-time is not included. 3. In clusters (or multi-clusters) where different nodes have different CPU speeds, the results could be the sum of CPU times from slower and faster processors. Such reslts cannot be used for determining how long the inspected process(es) are still expected to run. SEE ALSO mosps(1), mosrun(1), mosix(7).

January 2012

MOSIX

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

NAME DIRECT COMMUNICATION migratable sockets between MOSIX processes PURPOSE Normally, migratable MOSIX processes do all their I/O (and most system-calls) via their home-node: this can be slow because operations are limited by the network speed and latency. Direct communication allows processes to pass messages directly between them, bypassing their home-nodes. For example, if process X whose home-node is A and runs on node B wishes to send a message over a socket to process Y whose home-node is C and runs on node D, then the message has to pass over the network from B to A to C to D. Using direct communication, the message will pass directly from B to D. Moreover, if X and Y run on the same node, the network is not used at all. To facilitate direct communication, each MOSIX process (running under mosrun(1)) can own a "mailbox". This mailbox can contain at any time up to 10000 unread messages of up to a total of 32MB. MOSIX Processes can send messages to mailboxes of other processes anywhere within the multi-cluster (that are willing to accept them). Direct communication makes the location of processes transparent, so the senders do not need to know where the receivers run, but only to identify them by their home-node and process-ID (PID) in their home-node. Direct communication guarantees that the order of messages per receiver is preserved, even when the sender(s) and receiver migrate - no matter where to and how many times they migrate. SENDING MESSAGES To start sending messages to another process, use: them = open("/proc/mosix/mbox/{a.b.c.d}/{pid}", 1); where {a.b.c.d} is the IP address of the receivers home-node and {pid} is the process-ID of the receiver. To send messages to a process with the same home-node, you can use 0.0.0.0 instead of the local IP address (this is even preferable, because it allows the communication to proceed in the rare event when the home-node is shut-down from its cluster). The returned value (them) is not a standard (POSIX) le-descriptor: it can only be used within the following system calls: w = write(them, message, length); fcntl(them, F_SETFL, O_NONBLOCK); fcntl(them, F_SETFL, 0); dup2(them, 1); dup2(them, 2); close(them); Zero-length messages are allowed. Each process may at any time have up to 128 open direct communication le-descriptors for sending messages to other processes. These le-descriptors are inherited by child processes (after fork(2)). When dup2 is used as above, the corresponding le-descriptor (1 for standard-output; 2 for standard-error) is associated with sending messages to the same process as them. In that case, only the above calls (write, fcntl, close, but not dup2) can then be used with that descriptor.

Janury 2012

MOSIX

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

RECEIVING MESSAGES To start receiving messages, create a mailbox: my_mbox = open("/proc/mosix/mybox", O_CREAT, flags); where flags is any combination (bitwise OR) of the following: 1 2 4 8 Allow receiving messages from other users of the same group (GID). Allow receiving messages from all other users. Allow receiving messages from processes with other home-nodes. Do not delay: normally when attempting to receive a message and no tting message was received, the call blocks until either a message or a signal arrives, but with this ag, the call returns immediately a value of -1 (with errno set to EAGAIN). Receive a SIGIO signal (See signal(7)) when a message is ready to be read (for assynchroneous operation). Normally, when attempting to read and the next message does not t in the read buffer (the message length is bigger than the count parameter of the read(2) system-call), the next message is truncated. When this bit is set, the rst message that ts the read-buffer will be read (even if out of order): if none of the pending messages ts the buffer, the receiving process either waits for a new message that ts the buffer to arrive, or if bit 8 ("do not delay") is also set, returns -1 with errno set to EAGAIN. Treat zero-length messages as an end-of-le condition: once a zero-length message is read, all further reads will return 0 (pending and future messages are not deleted, so they can still be read once this ag is cleared).

16 32

The returned value (my_mbox) is not a standard (POSIX) le-descriptor: it can only be used within the following system calls: r = read(my_mbox, buf, count); r = readv(my_mbox, iov, niov); dup2(my_mbox, 0); close(my_mbox); ioctl(my_mbox, SIOCINTERESTED, addr); ioctl(my_mbox, SIOCSTOREINTERESTS, addr); ioctl(my_mbox, SIOCWHICH, addr); (see FILTERING below) Reading my_mbox always reads a single message at a time, even when count allows reading more messages. A message can have zero-length, but count cannot be zero. A count of -1 is a special request to test for a message without actually reading it. If a message is present for reading, read(my_mbox, buf, -1) returns its length - otherwise it returns -1 with errno set to EAGAIN. unlike in "SENDING MESSAGES" above, my_mbox is NOT inherited by child processes. When dup2 is used as above, le-descriptor 0 (standard-input) is associated with receiving messages from other processes, but only the read, readv and close system-calls can then be used with le-descriptor 0. Closing my_mbox (or close(0) if dup2(my_mbox, 0) was used - whichever is closed last) discards all pending messages. To change the flags of the mailbox without losing any pending messages, open it again (without using close): my_mbox = open("/proc/mosix/mybox", O_CREAT, new_flags);

MOSIX

Janury 2012

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

Note that when removing permission-ags (1, 2 and 4) from new_flags, messages that were already sent earlier will still arrive, even from senders that are no longer allowed to send messages to the current process. Re-opening always returns the same value (my_mbox) as the initial open (unless an error occurs and -1 is returned). Also note that if dup2(my_mbox, 0) was used, new_flags will immediately apply to ledescriptor 0 as well. Extra information is available about the latest message that was read (including when the count parameter of the last read() was -1 and no reading actually took place). To get this information, you should rst dene the following macro: static inline unsigned int GET_IP(char file_name) { int ip = open(file_name, 0); return((unsigned int)((ip==-1 && errno>255) ? -errno: ip)); } To nd the IP address of the senders home, use: sender_home = GET_IP("/proc/self/sender_home"); To nd the process-ID (PID) of the sender, use: sender_pid = open("/proc/self/sender_pid", 0); To nd the IP address of the node where the sender was running when the message was sent, use: sender_location = GET_IP("/proc/self/sender_location"); (this can be used, for example, to request a manual migration to bring together communicating processes to the same node) To nd the length of the last message, use: bytes = open("/proc/self/message_length", 0); (this makes it possible to detect truncated messages: if the last message was truncated, bytes will contain the original length) FILTERING The following facility allows the receiver to select which types of messages it is interested to receive: struct interested { unsigned char conditions; / bitmap of conditions / unsigned char testlen; / length of test-pattern (1-8 bytes) / int pid; / Process-ID of sender / unsigned int home; / home-node of sender (0 = same home) / int minlen; / minimum message length / int maxlen; / maximum message length / int testoffset; / offset of test-pattern within message / unsigned char testdata[8]; / expected test-pattern / int msgno; / pick a specic message (starting from 1) / int msgoffset; / start reading from given offset / }; / conditions: / #dene INTERESTED_IN_PID 1 #dene INTERESTED_IN_HOME 2

Janury 2012

MOSIX

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

#dene INTERESTED_IN_MINLEN #dene INTERESTED_IN_MAXLEN #dene INTERESTED_IN_PATTERN #dene INTERESTED_IN_MESSAGENO #dene INTERESTED_IN_OFFSET #dene PREVENT_REMOVAL struct interested lter; struct interests { long number; struct interested lters; } lters;

4 8 16 32 64 128

/ number of lters / / lters to store /

#dene SIOCINTERESTED 0x8985 #dene SIOCKSTOREINTERESTS 0x8986 #dene SIOCWHICH 0x8987 A call to: ioctl(my_mbox, SIOCINTERESTED, &filter); starts applying the given filter, while a call to: ioctl(my_mbox, SIOCINTERESTED, NULL); cancels the ltering. Closing my_mbox also cancels the ltering (but re-opening with different ags does not cancel the ltering). Calls to this ioctl return the address of the previous lter. When ltering is applied, only messages that comply with the lter are received: if there are no complying messages, the receiving process either waits for a complying message to arrive, or if bit 8 ("do not delay") of the flags from open("/proc/self/mybox", O_CREAT, flags) is set, read(my_mbox,...) and readv(my_mbox,...) return -1 with errno set to EAGAIN. Filtering can also be used to test for particular messages using read(my_mbox, buf, -1). Different types of messages can be received simply by modifying the contents of the filter between calls to read(my_mbox,...) (or readv(my_mbox,...)). filter.conditions is a bit-map indicating which condition(s) to consider: When INTERESTED_IN_PID is set, the process-ID of the sender must match filter.pid. When INTERESTED_IN_HOME is set, the home-node of the sender must match filter.home (a value of 0 can be used to match senders from the same home-node). When INTERESTED_IN_MINLEN is set, the message length must be at least filter.minlen bytes long. When INTERESTED_IN_MAXLEN is set, the message length must be no longer than filter.maxlen bytes. When INTERESTED_IN_PATTERN is set, the message must contain a given pattern of data at a given offset. The offset within the message is given by filter.testoffset, the patterns length (1 to 8 bytes) in filter.testlen and its expected contents in filter.testdata. When INTERESTED_IN_MESSAGENO is set, the message numbered filter.msgno (numbering starts from 1) will be read out of the queue of received messages.

MOSIX

Janury 2012

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

When INTERESTED_IN_OFFSET is set, reading begins at the offset filter.msgoffset of the messages data. When PREVENT_REMOVAL is set, read messages are not removed from the message-queue, so they can be re-read until this ag is cleared. A call to: ioctl(my_mbox, SIOCSTOREINTERESTS, &filters); stores an array of lters for later use by MOSIX: filters.number should contain the number of lters (0-1024) and filters.filters should point to an array of lters (in which the conditions INTERESTED_IN_MESSAGENO, INTERESTED_IN_OFFSET and PREVENT_REMOVAL are ignored). Successful calls return 0. Closing my_mbox also discards the stored lters (but re-opening with different ags does not). A call to: ioctl(my_mbox, SIOCWHICH, &bitmap) lls the given bitmap with information, one bit per lter, about whether (1) or not (0) there are any pending messages that match the lters that were previously stored by SIOCSTOREINTERESTS (above). The number of bytes affected in bitmap depends on the number of stored lters. If unsure, reserve the maximum of 128 bytes (for 1024 lters). Successful calls return the number of lters previously stored by SIOCSTOREINTERESTS. ERRORS Sender errors: ENOENT Invalid pathname in open: the specied IP address is not part of this cluster/multi-cluster, or the process-ID is out of range (must be 2-32767). ESRCH No such process (this error is detected only when attempting to send - not when opening the connection). EACCES No permission to send to that process. ENOSPC Non-blocking (O_NONBLOCK) was requested and the receiver has no more space to accept this message - perhaps try again later. ECONNABORTED The home-node of the receiver is no longer in our multi-cluster. EMFILE The maximum of 128 direct communicaiton le-descriptors is already in use. EINVAL When opening, the second parameter does not contain the bit "1"; When writing, the length is negative or more than 32MB. ETIMEDOUT Failed to establish connection with the mail-box managing daemon (mospostald). ECONNREFUSED The mail-box managing (mospostald) refused to serve the call (probably a MOSIX installation error).

Janury 2012

MOSIX

DIRECT COMMUNICATION (M7)

MOSIX Description

DIRECT COMMUNICATION (M7)

EIO

Communication breakdown with the mail-box managing daemon (mospostald).

Receiver errors: EAGAIN No message is currently available for reading and the "Do not delay" ag is set (or count is -1). EINVAL One or more values in the ltering structure are illegal or their combination makes it impossible to receive any message (for example, the offset of the data-pattern is beyond the maximum message length). Also, an attempt to store either a negative number or more than 1024 lters. ENODATA The INTERESTED_IN_MESSAGENO lter is used, and either "no truncating" was requested (32 in the open-ags) while the message does not t the read buffer, or the message does not full the other ltering conditions. Errors that are common to both sender and receiver: EINTR Read/write interrupted by a signal.

ENOMEM Insufcient memory to complete the operation. EFAULT Bad read/write buffer address. ENETUNREACH Could not establish a connection with the mail-box managing deamon (mospostald). ECONNRESET Connection lost with the mail-box managing daemon (mospostald). POSSIBLE APPLICATIONS The scope of direct communication is very wide: almost any program that requires communication between related processes can benet. Following are a few examples: 1. 2. 3. Use direct communication within standard communication packages and libraries, such as MPI. Pipe-like applications where one process output is the others input: write your own code or use the existing mospipe(1) MOSIX utility. Direct communiction can be used to implement fast I/O for migrated processes (with the cooperation of a local process on the node where the migrated process is running). In particular, it can be used to give migrated processes access to data from a common NFS server without causing their home-node to become a bottleneck.

LIMITATIONS Processes that are involved in direct communication (having open le-descriptors for either sending or receiving messages) cannot be checkpointed and cannot execute mosrun recursively or mosnative (see mosrun(1)). SEE ALSO mosrun(1), mospipe(1), mosix(7).

MOSIX

Janury 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

NAME MOSIX sharing the power of clusters and multi-clusters INTRODUCTION MOSIX is a generic solution for dynamic management of resources in a cluster or in an organizational multicluster. MOSIX allows users to draw the most out of all the connected computers, including utilization of idle computers. At the core of MOSIX are adaptive resource sharing algorithms, applying preemptive process migration based on processor loads, memory and I/O demands of the processes, thus causing the cluster or the multicluster to work cooperatively similar to a single computer with many processors. Unlike earlier versions of MOSIX, only programs that are started by the mosrun(1) utility are affected and can be considered "migratable" - other programs are considered as "standard Linux programs" and are not affected by MOSIX. MOSIX maintains a high level of compatiblity with standard Linux, so that binaries of almost every application that runs under Linux can run completely unmodied under the MOSIX "migratable" category. The exceptions are usually system-administration or graphic utilities that would not benet from process-migration anyway. If a "migratable" program that was started by mosrun(1) attempts to use unsupported features, it will either be killed with an appropriate error message, or if a do not kill option is selected, an error is returned to the program: such programs should probably run as standard Linux programs. In order to improve the overall resource usage, processes of "migratable" programs may be moved automatically and transparently to other nodes within the cluster or even the multi-cluster. As the demands for resources change, processes may move again, as many times as necessary, to continue optimizing the overall resource utilization, subject to the inter-cluster priorities and policies. Manual-control over process migration is also supported. MOSIX is particularly suitable for running CPU-intensive computational programs with unpredictable resource usage and run times, and programs with moderate amounts of I/O. Programs that perform large amounts of I/O should better be run as standard Linux programs. Apart from process-migration, MOSIX can provide both "migratable" and "standard Linux" programs with the benets of optimal initial assignment and live-queuing. The unique feature of live-queuing means that although a job is queued to run later, when resources are available, once it starts, it remains attached to its original Unix/Linux environment (standard-input/output/error, signals, etc.). REQUIREMENTS 1. All nodes must run Linux (any distribution - mixing allowed). 2. All participating nodes must be connected to a network that supports TCP/IP and UDP/IP, where each node has a unique IP address in the range 0.1.0.0 to 255.255.254.255 that is accessible to all the other nodes. TCP/IP ports 249-254 and UDP/IP ports 249-250 and 253 must be available for MOSIX (not used by other applications or blocked by a rewall). The architecture of all nodes must be x86_64 (64-bit). In multiprocessor nodes (SMP), all the processors must be of the same speed. The system-administrators of all the connected nodes must be able to trust each other (see more on SECURITY below).

3. 4. 5. 6.

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

CLUSTER AND MULTI-CLUSTER The MOSIX concept of a "cluster" is a collection of computers that are owned and managed by the same entity (a person, a group of people or a project) - this can at times be quite different than a hardware cluster, as each MOSIX cluster may range from a single workstation to a large combination of computers - workstations, servers, blades, multi-core computers, etc. possibly of different speeds and number of processors and possibly in different locations. A MOSIX multi-cluster is a collection of clusters that belong to different entities (owners) who wish to share their resources subject to certain administrative conditions. In particular, when an owner needs its computers - these computers must be returned immediately to the exclusive use of their owner. An owner can also assign priorities to guest processes of other owners, dening who can use their computers and when. Typically, an owner is an individual user, a group of users or a department that own the computers. The multicluster is usually restricted, due to trust and security reasons, to a single organization, possibly in various sites/branches, even across the world. MOSIX supports dynamic multi-cluster congurations, where clusters can join and leave at any time. When there are plenty of resources in the multi-cluster, the MOSIX queuing system allows more processes to start. When resources become scarce (because other clusters leave or claim their resources and processes must migrate back to their home-clusters), MOSIX has a freezing feature that can automatically freeze excess processes to prevent memory-overload on the home-nodes. CONFIGURATION To congure MOSIX interactively, simply run mosconf: it will lead you step-by-step through the various conguration items. Mosconf can be used in two ways: 1. 2. To congure the local node (press <Enter> at the rst question). To congure MOSIX for other nodes: this is typically done on a server that stores an image of the rootpartition for some or all of the cluster-nodes. This image can, for example, be NFS-mounted by the cluster-nodes, or otherwise copied or reected to them by any other method: at the rst question, enter the path to the stored root-image.

There is no need to stop MOSIX in order to modify the conguration - most changes will take effect within a minute. However, after modifying the list of nodes in the cluster (/etc/mosix/mosix.map) or /etc/mosix/mosip or /etc/mosix/myfeatures, you should run the command "mossetpe" (but when you are using mosconf to congure your local node, this is not necessary). Below is a detailed description of the MOSIX conguration les (if you prefer to edit them manually). The directory /etc/mosix should include at least the subdirectories /etc/mosix/partners, /etc/mosix/var, /etc/mosix/var/multi and the following les: /etc/mosix/mosix.map This le denes which computers participate in your MOSIX cluster. The le contains up to 256 data lines and/or alias lines that can be in any order. It may also include any number of comment lines beginning with a #, as well as empty lines. Data lines have 2 or 3 elds: 1. 2. 3. The IP ("a.b.c.d" or host-name) of the rst node in a range of nodes with consecutive IPs. The number of nodes in that range. Optional combination of letter-ags:

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

p[roximate] do not use compression on migration, e.g., over fast networks or slow CPUs. o[utsider] inaccessible to local-class processes. Alias lines are of the form: a.b.c.d=e.f.g.h or a.b.c.d=host-name They indicate that the IP address on the left-hand-side refers to the same node as the right-hand-side. NOTES: 1. 2. 3. 4. 5. It is an error to attempt to declare the local node an "outsider". When using host names, the rst result of gethostbyname(3) must return their IP address that is to be used by MOSIX: if in doubt - specify the IP address. The right-hand-side in alias lines must appear within the data lines. IP addresses 0.0.x.x and 255.255.255.x are not allowed in MOSIX. If you change /etc/mosix/mosix.map while MOSIX is running, you need to run mossetpe to notify MOSIX of the changes.

/etc/mosix/secret This is a security le that is used to prevent ordinary users from interfering and/or compromizing security by connecting to the internal MOSIX TCP ports. The le should contain just a single line with a password that must be identical on all the nodes of the cluster/multi-cluster. This le must be accessible to ROOT only (chmod 600!) /etc/mosix/ecsecret Like /etc/mosix/secret, but used on the client-side for running programs with a different home-node (mosrun -M) as well as batch jobs (see mosrun(1) and mosbatch(1)). If you do not wish to allow this node to use these features, then do not create this le. /etc/mosix/essecret Like /etc/mosix/ecsecret, but on the server side, where the password must match the clients . If you do not wish to allow this node to be a home-node for programs from other nodes, or to serve batch-jobs, then do not create this le. The following les are optional: /etc/mosix/mosip This le contains our IP address, to be used for MOSIX purposes, in the regular format - a.b.c.d. This le is only necessary when the nodes IP address is ambiguous: it can be safely omitted if the output of ifconfig(8) ("inet addr:") matches exactly one of the IP addresses listed in the data lines of /etc/mosix/mosix.map. /etc/mosix/freeze.conf This le sets the automatic freezing policies on a per-class basis for MOSIX processes originating in this node. Each line describes the policy for one class of processes. The lines can be in any order and classes that are not mentioned are not touched by the automatic freezing mechanisms. The space-separated constants in each line are as follows: 1. class-number A positive integer identifying a class of processes. 2. load-units: Used in elds #3-#6 below: 0=processes; 1=standard-load

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

3. RED-MARK (oating point) Freeze when load is higher. 4. BLUE-MARK (oating point) Unfreeze when load is lower. 5. minautofreeze (oating point) Freeze processes that are evacuated back home on arrival if load gets equal or above this. 6. minclustfreeze (oating point) Freeze processes that are evacuated back to this cluster on arrival if load gets equal or above this. 7. min-keep Keep running at least this number of processes - even if load is above RED-MARK. 8. max-procs Freeze excess processes above this number - even if load is below BLUE-MARK. 9. slice Time (in minutes) that a process of this class is allowed to run while there are automatically-frozen process(es) of this class. After this period, the running process will be frozen and a frozen process will start to run. 10. killing-memory Freezing fails when there is insufcient disk-space to save the memory-image of the frozen process - kill processes that failed to freeze and have above this number of MegaBytes of memory. Processes with less memory are kept alive (and in memory). Setting this value to 0, causes all processes of this class to be killed when freezing fails. Setting it to a very high value (like 1000000 MegaBytes) keeps all processes alive. NOTES: 1. The load-units in elds #3-#6 depend on eld #2. If 0, each unit represents the load created by a CPU-bound process on this computer. If 1, each unit represents the load created by a CPUbound process on a "standard" MOSIX computer (e.g. a 3GHz Intel Core 2 Duo E6850). The difference is that the faster the computer and the more processors it has, the load created by each CPU process decreases proportionally. Fields #3,#4,#5,#6 are oating-point, the rest are integers. A value of "-1" in elds #3,#5,#6,#8 means ignoring that feature. The rst 4 elds are mandatory: omitted elds beyond them have the following values: minautofreeze=-1,minclusterfreeze=-1,min-keep=0, max-procs=-1,slice=20. The RED-MARK must be signicantly higher than BLUE-MARK: otherwise a perpetual cycle of freezing and unfreezing could occur. You should allow at least 1.1 processes difference between them. Frozen processes do not respond to anything, except an unfreeze request or a signal that kills them. Processes that were frozen manually are not unfrozen automatically.

2. 3. 4. 5.

6. 7.

This le may also contain lines starting with / to indicate freezing-directory names. A "Freezing directory" is an existing directory (often a mount-point) where the memory contents of frozen process is saved. For successful freezing, the disk-partition of freezing-directories should have sufcient free disk-space to contain the memory image of all the frozen processes. If more than one freezing directory is listed, the freezing directory is chosen at random by each freezing process. It is also possible to assign selection probabilities by adding a numeric weight after the directory-name, for example:

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

/tmp 2 /var/tmp 0.5 /mnt/tmp 2.5 In this example, the total weight is 2+0.5+2.5=5, so out of every 10 frozen processes, an average of 4 (102/5) will be frozen to /tmp, an average of 1 (100.5/5) to /var/tmp and an average of 5 (102.5/5) to /mnt/tmp. When the weight is missing, it defaults to 1. A weight of 0 means that this directory should be used only if all others cannot be accessed. If no freezing directories are specied, all freezing will be to the /freeze directory (or symboliclink). Freezing les are usually created with "root" (Super-User) permissions, but if /etc/mosix/freeze.conf contains a line of the form: U {UID} then they are created with permissions of the given numeric UID (this is sometimes needed when freezing to NFS directories that do not allow "root" access). /etc/mosix/partners/ If your cluster is part of a multi-cluster, then each le in /etc/mosix/partners describes another cluster that you want this cluster to cooperate with. The le-names should indicate the corresponding cluster-names (maximum 128 characters), for example: "geography", "chemistry", "management", "development", "sales", "students-lab-A", etc. The format of each le is a follows: Line #1: A verbal human-readable description of the cluster. Line #2: Four space-separated integers as follows: 1. Priority: 0-65535, the lower the better. The priority of the local cluster is always 0. MOSIX gives precedence to processes with higher priority - if they arrive, guests with lower priority will be expelled. 2. Cango: 0=never send local processes to that cluster. 1=local processes may go to that cluster. 3. Cantake: 0=do not accept guest-processes from that cluster. 1=accept guest-processes from that cluster. 4. Canexpand: 0=no: Only nodes listed in the lines below may be recognized as part of that cluster: if a core node from that cluster tells us about other nodes in their cluster - ignore those unlisted nodes. 1=yes: Core-nodes of that cluster may specify other nodes that are in that cluster, and this node should believe them even if they are not listed in the lines below.

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

-1=do not ask the other cluster: do not consult the other cluster to nd out which nodes are in that cluster: instead just rely on and use the lines below. Following lines: Each line describes a range of consecutive IP addresses that are believed to be part of the other cluster, containing 5 space-separated items as follows: 1. IP1 (or host-name): First node in range. 2. n: Number of nodes in this range. 3. Core: 0=no: This range of nodes may not inform us about who else is in that cluster. 1=yes: This range of nodes could inform us of who else is in that cluster. 4. Participate: 0=no This range is (as far as this node is concerned) not part of that cluster. 1=yes This range is probably a part of that cluster. 5. Proximate: 0=no Use compression on migration to/from that cluster. 1=yes Do not use compression when migrating to/from that cluster (network is very fast and CPU is slow). NOTES: 1. From time-to-time, MOSIX will consult one or more of the "core" nodes to nd the actual map of their cluster. It is recommended to list such core nodes. The alternative is to set canexpand to -1, causing the map of that cluster to be determined solely by this le. Nodes that do not "participate" are excluded even if listed as part of their cluster by the corenodes (but they could possibly still be used as "core-nodes" to list other nodes) All core-nodes must have the same value for "proximate", because the "proximate" eld of unlisted nodes is copied from that of the core-node from which we happened to nd about them and this cannot be ambiguous. When using host names rather than IP addresses, the rst result of gethostbyname(3) must return their IP address that is used by MOSIX: if in doubt - specify the IP address instead. IP addresses 0.0.x.x and 255.255.255.x cannot be used in MOSIX.

2. 3.

4. 5.

/etc/mosix/userview.map Although it is possible to use only IP numbers and/or host-names to specify nodes in your cluster (and multi-cluster), it is more convenient to use small integers as node numbers: this le allows you to map integers to IP addresses. Each line in this le contains 3 elements: 1. 2. 3. A node number (1-65535) IP1 (or host-name, clearly identiable by gethostbyname(3)) Number of nodes in range (the number of the last one must not exceed 65535)

It is up to the cluster administrator to map as few or as many nodes as they wish out of their cluster and multi-cluster - the most common practice is to map all the nodes in ones cluster, but not in other clusters. /etc/mosix/queue.conf This le congures the queueing system (see mosrun(1), mosq(1)). All lines in this le are optional and may appear in any order.

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

Usually, one node in each cluster is elected by the system-administrator to manage the queue, while the remaining nodes point to that manager. Defining the queue manager: The line: C {hostname} assigns a specic node from the cluster (hostname) to manage the job queue. In the absence of this line, each node manages its own queue (which is usually undesirable). Defining the default priority: The line: P {priority} assigns a default job-priority to all the jobs from this node. The lower this value - the higher the priority. In the absence of this line, the default priority is 50. Commonly, user-IDs are identical on all the nodes in the cluster. The line (with a single letter): S indicates that this is not the case, so users on other nodes (except the Super-User) will be prevented from sending requests to modify the status of queued jobs from this node. Configuring the queue manager: The following lines are relevant only in the queue manager node and are ignored on all other nodes: The MOSIX queueing system determines dynamically how many processes to run. The line: M {maxproc} if present, imposes a maximal number of processes that are allowed to run from the queue simultaneously on top of the regular queueing policy. For example, M 20 sets the upper limit to 20 processes, even when more resources are available. The line: X {1 <= x <= 8} denes the maximal number of queued processes that may run simultaneously per CPU. This option applies only to processors within the cluster and is not available for other clusters in a multicluster (where the queueing system assigns at most one process per CPU). In the absence of this line the default is X 1 The line: Z {n} causes the rst n jobs of priority 0 to start immediately (out of order), without checking whether resources are available, leaving that responsibility to the system administrator. Example: the cluster has 10 dual-CPU nodes, so the queueing system normally allows 20 jobs to run. In order to allow urgent jobs to run immediately (without waiting for regular jobs to complete), the system administrator congures a line: Z 10, thus allowing each node to run a maximum of 3 jobs. The line: N {n} [{mb}] causes the rst n jobs of each user to start immediately (out of order), without checking whether resources are available. Only jobs above that number, per user, will be queued and whenever the number of a users running jobs drops below this number, a new job of that user (if there is any waiting) will start to run.

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

When the mb parameter is given, only jobs that do not exceed this amount of memory in MegaBytes will be started this way. The system-administrator should weigh carefully, based on knowledge about the patterns of jobs that users typically run, the benets of this option against its risks, such as having at times more jobs in their cluster(s) than available memory to run them efciently. If this option is selected with a memory-limitation (mb), then the system-administrator should request that users always specify the maximum memory-requiremnts for all their queued jobs (using mosrun -m). The line: T {max} limits the number of simultaneous jobs that any user can queue. Previously-queued jobs that are still running are included in that count, but jobs submitted from other nodes are not included. The Super-User is exempt from this limit. Fair-share policy: The fairness policy determine the order in which jobs are initially placed in the queue. Note that fairness should not be confused with priority (as dened by the P {priority} line or by mosrun -q{pri} and possibly modied by mosq(1)): priorities always take precedence, so here we only discuss the initial placement in the queue of jobs with the same priority. The default queueing policy is "rst-come-rst-served". Alternatively, jobs of different users can be placed in the queue in an interleaved manner. The line (with a single letter): F switches the queueing policy to the interleaved policy. The advantage of the interleaved approach is that a user wishing to run a relatively small number of processes, does not need to wait for all the jobs that were already placed in the queue. The disadvantage is that older jobs need to wait longer. Normally, the interleaving ratio is equal among all users. For example, with two users (A and B) the queue may look like A-B-A-B-A-B-A-B. Each user is assigned an interleave ratio which determines (proportionally) how well their jobs will be placed in the queue relative to other users: the smaller that ratio - the better placement they will get (and vice versa). Normally all users receive the same default interleave-ratio of 10 per process. However, lines of the form: U {UID} {1 <= interleave <= 100} can set a different interleave ratio for different users. UID can be either numeric or symbolic and there is no limit on the number of these U lines. Examples: 1. Two users (A & B): U userA 5 (userB is not listed, hence it gets the default of 10) The queue looks like: A-A-B-A-A-B-A-A-B... 2. Two users (A & B): U userA 20 U userB 15 The queue looks like: B-A-B-A-B-A-B-B-A-B-A-B-A-B-B-A... 3. Three users (A, B & C): U userA 25 U userB 7 (userC is not listed, hence it gets the default of 10) The queue looks like: B-C-BC-B-A-B-C-B-C-B-A-B-C-B-C...

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

Note that since the interleave ratio is determined per process (and not per job), different (more complex) results will occur when multi-process jobs are submitted to the queue. /etc/mosix/private.conf This le species where Private Temporary Files (PTFs) are stored: PTFs are an important feature of mosrun(1) and may consume a signicant amount of disk-space. It is important to ensure that sufcient disk-space is reserved for PTFs, but without allowing them to disturb other jobs by lling up disk-partitions. Guest processes can also demand unpredictable amounts of disk-space for their PTFs, so we must make sure that they do not disturb local operations. Up to 3 different directories can be specied: for local processes; guest-processes from the local cluster; and guest-processes from other clusters in the multi-cluster. Accordingly, each line in this le has 3 elds: 1. 2. 3. A combination of the letters: O (own node), C (own cluster) and G (other clusters). For example, OC, C, CG or OCG. A directory name (usually a mount-point) starting with /, where PTFs for the above processes are to be stored. An optional numeric limit, in Megabytes, of the total size of PTFs per-process.

If /etc/mosix/private.conf does not exist, then all PTFs will be stored in "/private". If the directory "/private" also does not exist, or if /etc/mosix/private.conf exists but does not contain a line with an appropriate letter in the rst eld (O, C or G), then no disk-space is allocated for PTFs of the affected processes, which usually means that processes requiring PTFs will not be able to run on this node. Such guest processes that start using PTFs will migrate back to their home-nodes. When the third eld is missing, it defaults to: 5 Gigabytes for local processes. 2 Gigabytes for processes from the same cluster. 1 Gigabyte for processes from other clusters. In any case, guest processes cannot exceed the size limit of their home-node even on nodes that allow them more space. /etc/mosix/target.conf This le contains the MOSRC (MOSIX Reach the Clouds) conguration, which determines who can launch MOSRC jobs that run on this node and what privileges and restrictions those launched jobs may have. Each line begins with a colon-terminated keyword, followed by specic parameters for that keyword. Keywords can be listed more than once. The keywords are: accept: An IP address, or a range of consecutive IP addresses from where this node is willing to accept MOSRC jobs. An example of a single IP address is: accept: 101.102.103.104 An example of a range of IP address is: accept: 101.102.103.1 - 101.102.104.254 The address(es) may be followed by an alternative le-name (starting in /): in that case, the priviliges and restrictions for jobs from the given address(es) are contained in the given le INSTEAD of /etc/mosix/target.conf. For example:

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

accept: 1.2.3.1 /etc/mosix/special_case_1.2.3

1.2.3.254

Alternative les have the same format as /etc/mosix/target.conf, except that they do not contain the keywords accept: and reject:. reject: IP addresses are specied as in accept: all MOSRC jobs will be rejected from those address(es). This option is useful for excluding particular addresses in the middle of a larger range that is dened by accept:, for example: accept: 10.20.30.1 - 10.20.31.254 reject: 10.20.30.255 - 10.20.31.0 nodir: Prevent callers from overriding a given directory with a directory from their calling computer. Note that overriding all ancesstor-directories is also prevented (since overriding them would override everything inside them as well, including the given directory). For example: nodir: /usr/share/X11 prevents callers from overriding the directories "/usr/share/X11", "/usr/share" and "/usr" (it is anyway prohibited to override the root-directory). nodir_under: As nodir: but all subdirectories are also prevented from being overriden. allow-subdirs: If a caller asks to export a directory under a directory-name where: 1. No le or directory exists under that name. 2. The caller has no permission to create this directory. 3. Overriding that directory-name is not forbidden (eg. by nodir: or nodir_under:) and the named-directory or any of its ancesstor-directories appears with the allow-subdirs: keyword, then the given directory will be specially created for the caller (it will be empty and with "root" ownership). For example: allow-subdirs: /tmp allow-subdirs: /var/tmp uids: A list of which guest-users may run MOSRC jobs here. This list may include a combination of the following (in any order): {username} | {userID} A user that may run here, with their original UID. {userID}={username|userID} A user that may run here, with the given user-ID. -{username|userID} A user that may not run here. All users that are otherwise unmentioned may run here with their original user-ID.

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

={username|userID} All users that are otherwise unmentioned may run here with the given user-ID. The following example allows all users to run with their own User-IDs, except "root" that runs as "nobody" and "badguy" that may not run at all: uids: root=nobody -badguy gids: A list of which guest user-groups may run MOSRC jobs here. This list may include a combination of the following (in any order): {groupname} | {groupID} A group that may run here, with their original group-ID. {groupID}={groupname|groupID} A group that may run here, with the given group-ID. -{groupname|groupID} A group that may not run here. All groups that are otherwise unmentioned may run here with their original group-ID. ={groupname|groupID} All groups that are otherwise unmentioned may run here with the given group-ID. The following example allows all groups to run, but with group-ID "nogroup", except that groups "wheel" and "root" run as group "wheel": gids: =nogroup wheel root=wheel /etc/mosix/mosrc_users If this le exists, it grants special permissions for certain users to present their MOSRC jobs to target nodes as if they are of a different user. Each line contains a colon-terminated user-name or a numeric user-ID, followed by a space-separated list of other user-names or user-IDs which they can present their jobs as. Numeric user-IDs are preferrable where possible. For example: user1: user2 user3 1500: 1540 1560 1522 /etc/mosix/mosrun_admin This le, if it exists, contains a line with a combination of letters that control the running of mosrun(1). The applicable letters are: q Queue all jobs by default, as if mosrun -q was specied, unless the user explicitly uses the mosrun -qn argument Q Queue all jobs even if the user did not specify mosrun -q. Only the Super-User is exempt from this restriction and may run mosrun without going through the queue. b Select the best node to start on by default, even if the user did not specify where to run their job using the mosrun -b argument (but did not specify any other location option).

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

B Always select the best node to start on. Do not allow users (except the Super-User) to select other location options. e Always imply the mosrun -e argument, unless the user explicitly uses the mosrun u ag. w Always imply the mosrun -w argument, unless the user explicitly uses the mosrun -u or mosrun -e ags. This option overrides the e option. M Force the user to specify the maximum memory requirement using the mosrun -m{mb} argument. The Super-User is exempt. /etc/mosix/mrc_groups If this le exists, it grants special permissions for certain user-groups to present their MOSRC jobs to target nodes as if they are of a different group. Each line contains a colon-terminated group-name or a numeric group-ID, followed by a space-separated list of other group-names or group-IDs which they can present their jobs as. Numeric group-IDs are preferrable where possible. For example: group1: group2 group3 100: 102 103 104 105 /etc/mosix/retainpri This le contains an integer, specifying a delay in seconds: how long after all MOSIX processes of a certain priority (priorities are dened as above in /etc/mosix/partners/ and the current priority can be seen in /proc/mosix/priority) nish (or leave) to allow processes of lower priority (higher numbers) to start. When this le is absent, there is no delay and processes with lower priority may arrive as soon as there are no processes with a higher priority. /etc/mosix/speed If this le exists, it should contain a positive integer (1-10,000,000), providing the relative speed of the processor: the bigger the faster, where 10,000 units of speed are equivalent to a 3GHz Intel Core 2 Duo E6850. Normally this le is not necessary because the speed of the processor is automatically detected by the kernel when it boots. There are however two cases when you should consider using this option: 1. When you have a heterogeneous cluster and always use MOSIX to run a specic program (or programs) that perform better on certain processor-types than on others. 2. On Virtual-Machines that run over a hosting operating-system: in this case, the speed that the kernel detects is unreliable and can vary signicantly depending on the load of the underlying operating-systems when it boots. /etc/mosix/maxguests If this le exists, it should contain an integer limit on the number of simultaneous guest-processes from other clusters. Otherwise, the maximum number of guest-processes from other clusters is set to the default of 8 times the number of processors. /etc/mosix/.log_mosrun When this le is present, information about invocations of mosrun(1) and process migrations will be recorded in the system-log (by default "/var/log/messages" on most Linux distributions).

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

KERNEL Sometimes a MOSIX release provides patches for more than one Linux kernel version. Also, special kernelpatches are released from time to time to support particular Linux distributions (such as openSUSE): it is ne to mix different such kernels within the same cluster. It is even OK to mix older or newer kernels from other MOSIX releases, so long as the rst two numbers in their MOSIX version (run cat /proc/mosix/version to view the version) are identical to the rst two numbers of your MOSIX release. The MOSIX kernel patch is required for fully operational MOSIX systems with process-migration. A limited number of functions, such as batch jobs, queuing and viewing the loads, still works over any Linux kernel, even without the MOSIX kernel patch (or when the kernel is incompatible with the current MOSIX version). It is not recommended to have mixed clusters where some nodes have the MOSIX kernel-patch and others do not, but if you do so anyway, you should observe the following rules regarding job-queuing: On each "mixed" cluster, you may queue either migratable jobs or batch jobs, but not both. If you choose to queue migratable jobs, then you should select a node with the MOSIX kernel-patch as the queue-manager. If you choose to queue batch jobs, then you should select a node without the MOSIX kernel-patch as the queue-manager (see above the section about conguring /etc/mosix/queue.conf). INTERFACE FOR PROGRAMS The following interface is provided for programs running under mosrun(1) that wish to interface with their MOSIX run-time environment: All access to MOSIX is performed via the "open" system call, but the use of "open" is incidental and does not involve actual opening of les. If the program were to run as a regular Linux program, those "open" calls would fail, returning -1, since the quoted les never exist, and errno(3) would be set to ENOENT. open("/proc/self/{special}", 0) reads a value from the MOSIX run-time environment. open("/proc/self/{special}", 1|O_CREAT, newval) writes a value to the MOSIX run-time environment. open("/proc/self/{special}", 2|O_CREAT, newval) both writes a new value and return the previous value. (the O_CREAT ag is only required when your program is compiled with the 64-bit le-size option, but is harmless otherwise). Some "les" are read-only, some are write-only and some can do both (rw). The "les" are as follows: /proc/self/migrate writing a 0 migrates back home; writing -1 causes a migration consideration; writing the unsigned value of an IP address or a logical node number, attempts to migrate there. Successful migration returns 0, failure returns -1 (write only) /proc/self/lock When locked (1), no automatic migration may occur (except when running on the current node is no longer allowed); when unlocked (0), automatic migration can occur. (rw) /proc/self/whereami reads where the program is running: 0 if at home, otherwise usually an unsigned IP address, but if possible, its corresponding logical node number. (read only)

January 2012

MOSIX

MOSIX (M7)

MOSIX Description

MOSIX (M7)

/proc/self/nmigs reads the total number of migrations performed by this process and its MOSRUN ancesstors before it was born. (read only) /proc/self/sigmig Reads/sets a signal number (1-64 or 0 to cancel) to be received after each migration. (rw) /proc/self/glob Reads/modies the process class. Processes of class 0 are not allowed to migrate outside the local cluster. Classes can also affect the automatic-freezing policy. (rw) /proc/self/needmem Reads/modies the processs memory requirement in Megabytes, so it does not automatically migrate to nodes with less free memory. Acceptable values are 0-262143. (rw) /proc/self/unsupportok when 0, unsupported system-calls cause the process to be killed; when 1 or 2, unsupported systemcalls return -1 with errno set to ENOSYS; when 2, an appropriate error-message will also be written to stderr. (rw) /proc/self/clear clears process statistics. (write only) /proc/self/cpujob Normally when 0, system-calls and I/O are taken into account for migration considerations. When set to 1, they are ignored. (rw) /proc/self/localtime When 0, gettimeofday(2) is always performed on the home node. When 1, the date/time is taken from where the process is running. (rw) /proc/self/decayrate Reads/modies the decay-rate per second (0-10000): programs can alternate between periods of intensive CPU and periods of demanding I/O. Decisions to migrate should be based neither on momentary program behaviour nor on extremely long term behaviour, so a balance must be struck, where old process statistics gradually decay in favour of newer statistics. The lesser the decay rate, the more weight is given to new information. The higher the decay rate, the more weight is given to older information. This option is provided for users who know well the cyclic behavior of their program. (rw) /proc/self/checkpoint When writing (any value) - perform a checkpoint. When only reading - return the version number of the next checkpoint to be made. When reading and writing - perform a checkpoint and return its version. Returns -1 if the checkpoint fails, 0 if writing only and checkpoint is successful. (rw) /proc/self/checkpointfile The third argument (newval) is a pointer to a le-name to be used as the basis for future checkpoints (see mosrun(1)). (write only) /proc/self/checkpointlimit Reads/modies the maximal number of checkpoint les to create before recycling the checkpoint version number. A value of 0 unlimits the number of checkpoints les. The maximal value allowed is 10000000. /proc/self/checkpointinterval When writing, sets the interval in minutes for automatic checkpoints (see mosrun(1)). A value of 0 cancels automatic checkpoints. The maximal value allowed is 10000000. Note that writing has a side effect of reseting the time left to the next checkpoint. Thus, writing too frequently is not recom-

MOSIX

January 2012

MOSIX (M7)

MOSIX Description

MOSIX (M7)

mended. (rw) open("/proc/self/in_cluster", O_CREAT, node); return 1 if the given node is in the same cluster, 0 otherwise. The node can be either an unsigned, host-order IP address, or a node-number (listed in /etc/mosix/userview.map). More functions are available through the direct_communication(7) feature. The following information is available via the /proc le system for everyone to read (not just within the MOSIX run-time environment): /proc/{pid}/from The IP address (a.b.c.d) of the process home-node ("0" if a local process). /proc/{pid}/where The IP address (a.b.c.d) where the process is running ("0" if running here). /proc/{pid}/class The class of the process. /proc/{pid}/origipid The original PID of the process on its home-node ("0" if a local process). /proc/{pid}/freezer Whether and why the process was frozen: 0 1 2 3 -66 Not frozen Frozen automatically due to high load. Frozen by the evacuation policy, to prevent ooding by arriving processes when clusters are disconnected. Frozen due to manual request. This is a guest process from another home-mode (freezing is always on the home-node, hence not applicable here).

Attempting to read the above for non-MOSIX processes returns the string "-3". STARTING MOSIX To start MOSIX, run /etc/init.d/mosix start. Alternately, run mosd. SECURITY All nodes within a MOSIX cluster and multi-cluster must trust each others super-user(s) - otherwise the security of the whole cluster or multi-cluster is compromized. Hostile computers must not be allowed physical access to the internal MOSIX network where they could masquerade as having IP addresses of trusted nodes. SEE ALSO mosrun(1), mosbatch(1), mosctl(1), mosmigrate(1), mossetpe(1), mosmon(1), mosps(1), mostimeof(1), moskillall(1), mosq(1), mosbestnode(1), mospipe(1), mosrc(1), direct_communication(7).

January 2012

MOSIX

MOSCTL (M1)

MOSIX Commands

MOSCTL (M1)

NAME MOSCTL - Miscellaneous MOSIX functions SYNOPSIS mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl mosctl

stay nostay lstay nolstay block noblock logmap nologmap expel bring shutdown isolate rejoin [{maxguests}] maxguests [{maxguests}] openmulti [{maxguests}] closemulti cngpri {partner} {newpri} [{partner2} {newpri2}]..., whois [{node_number}|IP-address|hostname] status [{node_number}|IP-address|hostname] localstatus rstatus [{node_number}|IP-address|hostname]

DESCRIPTION Most Mosctl functions are for MOSIX administration and are available only to the Super-User. The exceptions are the whois, status and rstatus functions which provide information to all users. mosctl stay prevents processes from migrating away automatically: mosctl nostay cancels this state. mosctl lstay prevents local processes from migrating away automatically, but still allows guest processes to leave: mosctl nolstay cancels this state. mosctl block prevents guest processes from moving in: mosctl noblock cancels this state. mosctl logmap tells the kernel to log the MOSIX map of nodes to the console (and/or the Linux kernellogging facility) whenever it changes (this is the default). mosctl nologmap stops logging such changes. mosctl expel expels all guest processes. It does not return until all guest processes are moved away (it can be interrupted, in which case there is no guarantee that all guest processes were expelled). mosctl bring brings back all processes whose home-node is here. It does not return until all these processes arrive back (it can be interrupted, in which case there is no guarantee that all the processes arrived back). mosctl shutdown shuts down MOSIX. All guest processes are expelled and all processes whose homenode is here are brought back, then the MOSIX conguration is turned off. mosctl isolate disconnects the cluster from the multi-cluster, bringing back all migrated processes whose home-node is in the disconnecting cluster and sending away all guest processes from other clusters. To actually disconnect a cluster, this command must be issued on all the nodes of that cluster.

January 2012

MOSIX

MOSCTL (M1)

MOSIX Commands

MOSCTL (M1)

mosctl rejoin cancels the effect of mosctl isolate: an optional argument sets the number of guest processes that are allowed to move to this node or run here from outside the local cluster. When this argument is missing, no guest processes from outside the cluster will be accepted. mosctl maxguests prints the maximum number of guests that are allowed to migrate to this node from other clusters. mosctl maxguests arg, with a numeric argument arg, sets that maximum. mosctl openmulti sets the maximum number of guest processes from outside the local cluster to its argument. If no further argument is provided, that value is taken from /etc/mosix/maxguests and in the absence of that le, it is set to 8 times the number of processors. mosctl closemulti sets that maximum to 0 - preventing processes from other clusters to run on this node. mosctl cngpri modies the priority of one or more multi-cluster partners in /etc/mosix/partners (See mosix(7)). While it is also possible to simply edit the les in /etc/mosix/partners, using mosctl cngpri is easier and the changes take effect immediately, whereas when editing those les manually, the changes may take up to 20 seconds. mosctl whois, depending on its argument, converts host-names and IP addresses to node numbers or vice-versa. mosctl status outputs useful and user-friendly information about a given node. When the last argument is omitted, the information is about the local node. mosctl localstatus is like status, but adds more information that is only available locally. mosctl rstatus output raw information about a given node. When the last argument is omitted, the information is about the local node. This information consists of 11 integers: 1. status: a bit-map, where bits have the following meaning: 1 2 4 8 16 64 The node is currently part of our MOSIX conguration. Information is available about the node. The node is in "stay" mode (see above). The node is in "lstay" mode (see above). The node is in "block" mode (see above). The node may accept processes from here. Reasons for this bit to NOT be set include: We do not appear in that nodes map. That node is congured to block migration of processes from us. Our conguration does not allow sending processes to that node. That node is currently running higher-priority MOSIX processes. That node is currently running MOSIX processes with the same priority as our processes, but is not in our cluster and already reached its maximum number of allowed guest-processes. That node is blocked. 512 The information is not too old. 1024 The node prefers processes from here over its current guests. 8192 The node has a correct MOSIX kernel.

2. 3.

load: a value of 100 represents a standard load unit. availability: The lower the value the more available that node is: in the extremes, 65535 means that the node is available to all while 0 means that generally it is only available for processes from its own cluster.

MOSIX

January 2012

MOSCTL (M1)

MOSIX Commands

MOSCTL (M1)

4. 5. 6. 7. 8. 9. 10. 11. 12.

13.

speed: a value of 10000 represents a standard processor (Pentium-IV at 3GHz). ncpus: number of processors. frozen: number of frozen processes. utilizability: a percentage - less than 100% means that the node is under-utilized due to swapping activity. available memory: in pages. total memory: in pages. free swap-space: in 0.1GB units. total seap-space in 0.1GB units. privileged memory: in pages - pages that are currently taken by less privileged guests, but could be used by clusters of higher privilege (including this node when "1024" is included in the status above). number of processes: only MOSIX processes are counted and this count could differ from the load because it includes inactive processes.