Skip to content

optimize the interpreter #575

@adonovan

Description

@adonovan

The main loop of the interpreter looks like this:

	for {
		thread.Steps++
		if thread.Steps >= thread.maxSteps {
			if thread.OnMaxSteps != nil {
				thread.OnMaxSteps(thread)
			} else {
				thread.Cancel("too many steps")
			}
		}
		if reason := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&thread.cancelReason))); reason != nil {
			err = fmt.Errorf("Starlark computation cancelled: %s", *(*string)(reason))
			break loop
		}

		fr.pc = pc

		op := compile.Opcode(code[pc])
		pc++
		var arg uint32
		if op >= compile.OpcodeArgMin {
			// TODO(adonovan): opt: profile this.
			// Perhaps compiling big endian would be less work to decode?
			for s := uint(0); ; s += 7 {
				b := code[pc]
				pc++
				arg |= uint32(b&0x7f) << s
				if b < 0x80 {
					break
				}
			}
		}
		if vmdebug {
			fmt.Fprintln(os.Stderr, stack[:sp]) // very verbose!
			compile.PrintOp(f, fr.pc, op, arg)
		}

		switch op { ... }
}

The Go compiler now translates the final switch to a jump table, which is probably more efficient than its previous control-tree based translation. We should experiment with the following ideas to reduce the overhead of each loop iteration:

  • The thread.Steps increment and check needn't occur on every loop iteration. It would suffice to increment and check it only on at back edges (JMP and CJMP instructions with a negative displacement) and perhaps at CALL instructions (since they may take unbounded time). (If we make this change, we should increase the amount of the increment so that typical average growth rate of the step counter remains the same, so that users don't need to adjust the numeric constants of their step limits).

  • The cancelReason check could similarly be checked only at back edges and calls.

  • The argument decoding, and in particular its unpredictable loop, could be made more efficient by encoding the number of arg bytes into the opcode. For example, the switch could have distinct cases for CONSTANT1, CONSTANT2, CONSTANT4 (1-byte arg, 2-byte arg, 4-byte arg), factored something like this:

switch op {
...
  case CONSTANT4:
     arg = arg<<8 | op[pc++]
     arg = arg<<8 | op[pc++]
     fallthrough
  case CONSTANT2:
     arg = arg<<8 | op[pc++]
     fallthrough
  case CONSTANT1:
     arg = arg<<8 | op[pc++]
     /*...common impl...*/

If all three of these optimizations were implemented, then loop would contain just for { switch code[pc++] { ... } }. At that point it would be interesting to coordinate with the Go compiler folks and evaluate whether it is profitable for the compiler to recognize the pattern of a threaded interpreter and compile the switch code[pc++] logic (MOVD (table)(op<<3), R27; JMP (R27)) into the end of every case.

Two more important opportunities for optimization would be:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions