cmd/compile: BCE is better with reslicing than index variables

Take this piece of code:

```
$ cat f.go
package p

func slice(p []byte) {
        for len(p) > 4 {
                // zero bounds checks.
                _ = p[0]
                _ = p[1]
                _ = p[2]
                _ = p[3]

                p = p[4:] // reslicing is expensive.
        }
}

func index(p []byte) {
        i := 0
        for i < len(p) {
                _ = p[i+3] // BCE hint; bounds check

                _ = p[i]   // unexpected bounds check
                _ = p[i+1] // unexpected bounds check
                _ = p[i+2] // unexpected bounds check
                _ = p[i+3]

                i += 4 // incrementing i is cheap.
        }
}
$ go version
go version devel +6d5caf38e3 Thu Nov 22 02:59:55 2018 +0000 linux/amd64
$ go build -gcflags=-d=ssa/check_bce/debug=1 f.go
# command-line-arguments
./f.go:18:8: Found IsInBounds
./f.go:20:8: Found IsInBounds
./f.go:21:8: Found IsInBounds
./f.go:22:8: Found IsInBounds
```

It's easy to see why the first variant has zero bounds checks. However, reslicing can be expensive in a hot loop, so sometimes the code is rewritten to use indexes.

This is what the second variant does. I do realise that the two loops aren't equivalent - for example, if `len(p) == 5`, the second loop will panic since the length is not a multiple of 4. So I understand why the compiler needs to insert one bounds check.

Still, it seems to me like it should insert one, not four, since I've added the BCE hint. My first thought was that it couldn't prove that `i >= 0`, but changing the index to be unsigned still doesn't remove all the bounds checks that I'd expect.

I encountered this in the base64 encoder:

```
di, si := 0, 0
n := (len(src) / 3) * 3
for si < n {
        // Convert 3x 8bit source bytes into 4 bytes
        val := uint(src[si+0])<<16 | uint(src[si+1])<<8 | uint(src[si+2]) // 3 bounds checks

        dst[di+0] = enc.encode[val>>18&0x3F] // bounds check
        dst[di+1] = enc.encode[val>>12&0x3F] // bounds check
        dst[di+2] = enc.encode[val>>6&0x3F]  // bounds check
        dst[di+3] = enc.encode[val&0x3F]     // bounds check

        si += 3
        di += 4
}
```

Rewriting the loop to use reslicing like `for len(src) >= 3 { ...; src = src[3:] }` does remove the bounds checks, but slows down the encoder noticeably. Just like in my simple example above, BCE hints like `_ = src[si+2]` and unsigned indexes didn't help either.

I think there's probably a way to rewrite the loop index logic to trick BCE, but I think the compiler could be made smarter here. I'm not sure whether that should be an enhancement in the prove pass, or in the bounds check elimination pass.

/cc @aclements @rasky @josharian 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: BCE is better with reslicing than index variables #28942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: BCE is better with reslicing than index variables #28942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions