optimize the interpreter

The main loop of the interpreter looks like this:

```go
	for {
		thread.Steps++
		if thread.Steps >= thread.maxSteps {
			if thread.OnMaxSteps != nil {
				thread.OnMaxSteps(thread)
			} else {
				thread.Cancel("too many steps")
			}
		}
		if reason := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&thread.cancelReason))); reason != nil {
			err = fmt.Errorf("Starlark computation cancelled: %s", *(*string)(reason))
			break loop
		}

		fr.pc = pc

		op := compile.Opcode(code[pc])
		pc++
		var arg uint32
		if op >= compile.OpcodeArgMin {
			// TODO(adonovan): opt: profile this.
			// Perhaps compiling big endian would be less work to decode?
			for s := uint(0); ; s += 7 {
				b := code[pc]
				pc++
				arg |= uint32(b&0x7f) << s
				if b < 0x80 {
					break
				}
			}
		}
		if vmdebug {
			fmt.Fprintln(os.Stderr, stack[:sp]) // very verbose!
			compile.PrintOp(f, fr.pc, op, arg)
		}

		switch op { ... }
}
```

The Go compiler now translates the final `switch` to a jump table, which is probably more efficient than its previous control-tree based translation. We should experiment with the following ideas to reduce the overhead of each loop iteration:

- The thread.Steps increment and check needn't occur on every loop iteration. It would suffice to increment and check it only on at back edges (JMP and CJMP instructions with a negative displacement) and perhaps at CALL instructions (since they may take unbounded time). (If we make this change, we should increase the amount of the increment so that typical average growth rate of the step counter remains the same, so that users don't need to adjust the numeric constants of their step limits).

- The cancelReason check could similarly be checked only at back edges and calls.

- The argument decoding, and in particular its unpredictable loop, could be made more efficient by encoding the number of arg bytes into the opcode. For example, the switch could have distinct cases for CONSTANT1, CONSTANT2, CONSTANT4 (1-byte arg, 2-byte arg, 4-byte arg), factored something like this:
 
```go
switch op {
...
  case CONSTANT4:
     arg = arg<<8 | op[pc++]
     arg = arg<<8 | op[pc++]
     fallthrough
  case CONSTANT2:
     arg = arg<<8 | op[pc++]
     fallthrough
  case CONSTANT1:
     arg = arg<<8 | op[pc++]
     /*...common impl...*/
```

If all three of these optimizations were implemented, then loop would contain just `for { switch code[pc++] { ... } }`. At that point it would be interesting to coordinate with the Go compiler folks and evaluate whether it is profitable for the compiler to recognize the pattern of a threaded interpreter and compile the `switch code[pc++]` logic (`MOVD (table)(op<<3), R27; JMP (R27)`) into the end of every case.

Two more important opportunities for optimization would be:
- to somehow eliminate all the write barriers for `stack[sp++] = x`. See https://github.com/golang/go/issues/20533 for one approach;
- to eliminate all the bounds checks for access to the code array (`code[pc++]`) and stack (`stack[sp++]`). Of course, both of these would cause Go's memory safety to depend on the correctness of the Starlark bytecode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize the interpreter #575

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

optimize the interpreter #575

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions