Go Optimizations 101 (2024 - 03 - 16) - Tapir Liu
Go Optimizations 101 (2024 - 03 - 16) - Tapir Liu
Tapir Liu
Contents
0.1 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Memory Allocations 23
3.1 Memory blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Memory allocation places . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Memory allocation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Memory wasting caused by allocated memory blocks larger than needed . . . . . . 24
3.5 Reduce memory allocations and save memory . . . . . . . . . . . . . . . . . . . 26
3.6 Avoid unnecessary allocations by allocating enough in advance . . . . . . . . . . 26
3.7 Avoid allocations if possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 Save memory and reduce allocations by combining memory blocks . . . . . . . . 30
3.9 Use value cache pool to avoid some allocations . . . . . . . . . . . . . . . . . . 32
1
4.6.3 Before Go toolchain version 1.21, a reflect.ValueOf function call
makes the values referenced by its argument escape to heap . . . . . . . . 42
4.6.4 A call to the fmt.Print function makes the values referenced by its argu-
ments escape to heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.5 The values referenced by function return results will escape . . . . . . . . 43
4.7 Function inline might affect escape analysis results . . . . . . . . . . . . . . . . . 43
4.7.1 Function inlining is not always helpful for escape analysis . . . . . . . . . 44
4.8 Control memory block allocation places . . . . . . . . . . . . . . . . . . . . . . 45
4.8.1 Ensure a value is allocated on heap . . . . . . . . . . . . . . . . . . . . . 45
4.8.2 Use explicit value copies to help compilers detect some values don’t escape 46
4.8.3 Memory size thresholds used by the compiler to make allocation placement
decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8.4 Use smaller thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8.5 Allocate the backing array of a slice on stack even if its size is larger than
or equal to 64K (but not larger than 10M) . . . . . . . . . . . . . . . . . 53
4.8.6 Allocate the backing array of a slice with an arbitrary length on stack . . . 53
4.8.7 More tricks to allocate arbitrary-size values on stack . . . . . . . . . . . . 54
4.9 Grow stack in less times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Garbage Collection 57
5.1 GC pacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Automatic GC might affect Go program execution performance . . . . . . . . . . 58
5.3 How to reduce GC pressure? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Memory fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Memory wasting caused by sharing memory blocks . . . . . . . . . . . . . . . . 59
5.6 Try to generate less short-lived memory blocks to lower automatic GC frequency . 60
5.7 Use new heap memory percentage strategy to control automatic GC frequency . . . 60
5.8 Since Go toolchain 1.18, the larger GC roots, the larger GC cycle intervals . . . . . 63
5.9 Use memory ballasts to avoid frequent GC cycles . . . . . . . . . . . . . . . . . 66
5.10 Use Go toolchain 1.19 introduced memory limit strategy to avoid frequent GC cycles 68
6 Pointers 69
6.1 Avoid unnecessary nil array pointer checks in a loop . . . . . . . . . . . . . . . . 69
6.1.1 The case in which an array pointer is a struct field . . . . . . . . . . . . . 70
6.2 Avoid unnecessary pointer dereferences in a loop . . . . . . . . . . . . . . . . . . 72
7 Structs 75
7.1 Avoid accessing fields of a struct in a loop though pointers to the struct . . . . . . 75
7.2 Small-size structs are optimized specially . . . . . . . . . . . . . . . . . . . . . 76
7.3 Make struct size smaller by adjusting field orders . . . . . . . . . . . . . . . . . 76
2
8.10 Don’t use the second iteration variable in a for-range loop if high performance
is demanded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.11 Reset all elements of an array or slice . . . . . . . . . . . . . . . . . . . . . . . 86
8.12 Specify capacity explicitly in subslice expression . . . . . . . . . . . . . . . . . 87
8.13 Use index tables to save some comparisons . . . . . . . . . . . . . . . . . . . . 88
11 Maps 119
11.1 Clear map entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
11.2 aMap[key]++ is more efficient than aMap[key] = aMap[key] + 1 . . . . . . . 120
11.3 Pointers in maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
11.4 Using byte arrays instead of short strings as keys . . . . . . . . . . . . . . . . . . 120
11.5 Lower map element modification frequency . . . . . . . . . . . . . . . . . . . . 121
11.6 Try to grow a map in one step . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
11.7 Use index tables instead of maps which key types have only a small set of possible
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
12 Channels 126
12.1 Programming with channels is fun but channels are not the most performant way
for some use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3
12.2 Use one channel instead of several ones to avoid using select blocks . . . . . . . 127
12.3 Try-send and try-receive select code blocks are specially optimized . . . . . . . 129
13 Functions 131
13.1 Function inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
13.1.1 Which functions are inline-able? . . . . . . . . . . . . . . . . . . . . . . 132
13.1.2 A call to a function value is not inline-able if the value is hard to be deter-
mined at compile time . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
13.1.3 The go:noinline comment directive . . . . . . . . . . . . . . . . . . . 135
13.1.4 Write code in the ways which are less inline costly . . . . . . . . . . . . . 135
13.1.5 Make hot paths inline-able . . . . . . . . . . . . . . . . . . . . . . . . . 138
13.1.6 Manual inlining is often more performance than auto-inlining . . . . . . . 140
13.1.7 Inlining might do negative impact on performance . . . . . . . . . . . . . 141
13.2 Pointer parameters/results vs. non-pointer parameters/results . . . . . . . . . . . . 142
13.3 Named results vs. anonymous results . . . . . . . . . . . . . . . . . . . . . . . . 144
13.4 Try to store intermediate calculation results in local variables with sizes not larger
than a native word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.5 Avoid using deferred calls in loops . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.6 Avoid using deferred calls if extreme high performance is demanded . . . . . . . . 148
13.7 The arguments of a function call will be always evaluated when the call is invoked 148
13.8 Try to make less values escape to heap in the hot paths . . . . . . . . . . . . . . . 149
14 Interfaces 151
14.1 Box values into and unbox values from interfaces . . . . . . . . . . . . . . . . . 151
14.2 Try to void memory allocations by assigning interface to interface . . . . . . . . . 158
14.3 Calling interface methods needs a little extra cost . . . . . . . . . . . . . . . . . 159
14.4 Avoid using interface parameters and results in small functions which are called
frequently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4
5
0.1 Acknowledgments
Firstly, thanks to the entire Go community. An active and responsive community ensured this book
was finished on time.
Specially, I want to give thanks to the following people who helped me understand some details in
the official standard compiler and runtime implementations: Keith Randall, Ian Lance Taylor, Axel
Wagner, Cuong Manh Le, Michael Pratt, Jan Mercl, Matthew Dempsky, Martin Möhrmann, etc.
I’m sorry if I forgot mentioning somebody in the above list. There are so many kind and creative
gophers in the Go community that I must have missed out on someone.
I also would like to thank all gophers who ever made influences on this book, be it directly or
indirectly, intentionally or unintentionally.
Thanks to Olexandr Shalakhin for the permission to use one of the wonderful gopher icon designs
as the cover image. And thanks to Renee French for designing the lovely gopher cartoon character.
Thanks to the authors of the following open source software and libraries used in building this book:
• golang, https://go.dev/
• gomarkdown, https://github.com/gomarkdown/markdown
• goini, https://github.com/zieckey/goini
• go-epub, https://github.com/bmaupin/go-epub
• pandoc, https://pandoc.org
• calibre, https://calibre-ebook.com/
• GIMP, https://www.gimp.org
Thanks the gophers who ever reported mistakes in this book or made corrections for this book:
yingzewen, ivanburak, cortes-, skeeto@reddit, Yang Yang, DashJay, Stephan, etc.
6
Chapter 1
This book offers practical tricks, tips, and suggestions to optimize Go code performance. Its insights
are grounded in the official Go compiler and runtime implementation.
Life is full of trade-offs, and so is the programming world. In programming, we constantly balance
trade-offs between code readability, maintainability, development efficiency, and performance, and
even within each of these areas. For example, optimizing for performance often involves trade-offs
between memory savings, execution speed, and implementation complexity.
In real-world projects, most code sections don’t demand peak performance. Prioritizing maintain-
ability and readability generally outweighs shaving every byte or microsecond. This book focuses
on optimizing critical sections where performance truly matters. Be aware that some suggestions
might lead to more verbose code or only exhibit significant gains in specific scenarios.
The contents in this book include:
• how to consume less CPU resources.
• how to consume less memory.
• how to make less memory allocations.
• how to control memory allocation places.
• how to reduce garbage collection pressure.
This book neither explains how to use performance analysis tools, such as pprof, nor tries to study
deeply on compiler and runtime implementation details. The books also doesn’t introduce how to
use profile-guided optimization. None of the contents provided in this book make use of unsafe
pointers and cgo. And the book doesn’t talk about algorithms. In other words, this book tries to
provide some optimization suggestions in a way which is clear and easy to understand, for daily
general Go programming.
Without particularly indicated, the code examples provided in this book are tested and run on a
notebook with the following environment setup:
go version go1.22.1 linux/amd64
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Some benchmark times information is removed from benchmark results, to keep benchmark lines
short.
Please note that:
7
• some of the talked suggestions in this book work on any platform and for any CPU models,
but some others might only work on specified platforms and for specified CPU models. So
please benchmark them on the same environments as your production environments before
adopting any of them.
• some implementation details of the official standard Go compiler and runtime might change
from version to version, which means some of the talked suggestions might not work for
future Go toolchain versions.
• the book will be open sourced eventually, in a chapter by chapter way.
1.3 Feedback
Welcome to improve this book by submitting corrections to Go 101 issue list (https://github.com/g
o101/go101) for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, wrong
explanations, description flaws, code bugs, etc.
It is also welcome to send your feedback to the Go 101 twitter account: @go100and1 (https://twit
ter.com/go100and1).
8
Chapter 2
9
used in Go community. Personally, I think the terminology makes some conveniences when making
some explanations.)
10
2.3 Detailed type sizes
The following table lists the sizes (used in the official standard Go compiler) of all the 26 kinds of
types in Go. In the table, one word means one native word (4 bytes on 32-bit architectures and 8
bytes on 64-bit architectures).
11
architectures.
12
// 1 byte is padded here
c int16
// 4 bytes are padded here.
b int64
}
We can use the unsafe.Sizeof function to get value/type sizes. For example:
package main
import "unsafe"
type T1 struct {
a int8
b int64
c int16
}
type T2 struct {
a int8
c int16
b int64
}
func main(){
// The printed values are got on
// 64-bit architectures.
println(unsafe.Sizeof(T1{})) // 24
println(unsafe.Sizeof(T2{})) // 16
}
We can view the padding bytes as a form of memory wasting, a trade-off result between program
performance, code readability and memory saving.
In practice, generally, we should make related fields adjacent to get good readability, and only order
fields in the most memory saving way when it really needs.
13
implementation, except large-size struct and array types, all other types in Go could be viewed as
small-size types.
What are small-size struct and array values? There is also not a formal definition. The official
standard Go compiler tweaks some implementation details from version to version. However, in
practice, we can view struct types with no more than 4 native-word-size fields and array types
with no more than 4 native-word-size elements as small-size values, such as struct{a, b, c, d
int}, struct{element *T; len int; cap int} and [4]uint.
For the official standard Go compiler 1.22 versions, a copy cost leap happens between copying
9-element arrays and copying 10-element arrays (the element size is one native word). The similar
is for copying 9-field structs and copying 10-field structs (each filed size is one native word).
The proof:
package copycost
import "testing"
const N = 1024
type Element = uint64
14
}
The benchmark results:
Benchmark_CopyArray_9_elements-4 3974 ns/op
Benchmark_CopyArray_10_elements-4 8896 ns/op
Benchmark_CopyStruct_9_fields-4 2970 ns/op
Benchmark_CopyStruct_10_fields-4 8471 ns/op
This results indicate copying arrays with less than 10 elements and structs with less than 10 fields
might be specially optimized.
The official standard Go compiler might use different criteria for other scenarios to determine what
are small struct and array types. For example, in the following benchmark code, the Add4 function
consumes much less CPU resources than the Add5 function (with the official standard Go compiler
1.22 versions).
package copycost
import "testing"
//go:noinline
func Add4(x, y T4) (z T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
return
}
//go:noinline
func Add5(x, y T5) (z T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
return
}
15
var x, y T5
t5 = Add5(x, y)
}
}
The benchmark results:
Benchmark_Add4-4 2.649 ns/op
Benchmark_Add5-4 19.15 ns/op
The //go:noinline compiler directives used here are to prevent the calls to the two function from
being inlined. If the directives are removed, the Add4 function will become even more performant.
import "testing"
const N = 1024
//go:noinline
func Sum_RangeArray(a [N]int) (r int) {
for _, v := range a {
r += v
}
return
}
//go:noinline
func Sum_RangeArrayPtr1(a *[N]int) (r int) {
for _, v := range *a {
r += v
}
return
}
//go:noinline
func Sum_RangeArrayPtr2(a *[N]int) (r int) {
for _, v := range a {
16
r += v
}
return
}
//go:noinline
func Sum_RangeSlice(a []int) (r int) {
for _, v := range a {
r += v
}
return
}
//===========
var r [128]int
17
for i := 0; i < b.N; i++ {
r[i&127] = Sum_RangeSlice(a[:])
}
}
The benchmark results:
Benchmark_Sum_RangeArray-4 897.6 ns/op
Benchmark_Sum_RangeArrayPtr1-4 799.3 ns/op
Benchmark_Sum_RangeArrayPtr2-4 555.3 ns/op
Benchmark_Sum_RangeSlice-4 561.7 ns/op
From the results, we could find that the Sum_RangeArray function is the slowest one. This is not
surprising, because the array value is copied twice in calling this function. One copy happens when
passing the array as the argument (arguments are passed by copy in Go), the other happens when
ranging over the array parameter (the direct part of the container following the range keyword will
be copied if the second iteration variable is used).
The Sum_RangeArrayPtr1 function is faster than Sum_RangeArray, because the array value is
only copied once in calling this function. The copy happens when range over the array.
No array copying happens in the calls to the remaining two functions, so those two functions are
both the fastest ones.
Example 2:
package copycost
import "testing"
//go:noinline
func Sum_PlainForLoop(s []Element) (r int64) {
for i := 0; i < len(s); i++ {
r += s[i][0]
}
return
}
//go:noinline
func Sum_OneIterationVar(s []Element) (r int64) {
for i := range s {
r += s[i][0]
}
return
}
//go:noinline
func Sum_UseSecondIterationVar(s []Element) (r int64) {
for _, v := range s {
r += v[0]
}
return
}
18
//===================
var r [128]int64
19
import "testing"
//go:noinline
func sum_UseSecondIterationVar(s []S) int {
var sum int
for _, v := range s {
sum += v.c
sum += v.d
sum += v.e
}
return sum
}
//go:noinline
func sum_OneIterationVar_Index(s []S) int {
var sum int
for i := range s {
sum += s[i].c
sum += s[i].d
sum += s[i].e
}
return sum
}
//go:noinline
func sum_OneIterationVar_Ptr(s []S) int {
var sum int
for i := range s {
v := &s[i]
sum += v.c
sum += v.d
sum += v.e
}
return sum
}
20
func Benchmark_OneIterationVar_Index(b *testing.B) {
for i := 0; i < b.N; i++ {
r[i&127] = sum_OneIterationVar_Index(s)
}
}
import (
"fmt"
"time"
)
func foo() {
for a, i := (Large{}), 0; i < len(a); i++ {
readOnly(&a, i)
}
}
func main() {
bench := func() time.Duration {
start := time.Now()
foo()
return time.Since(start)
}
fmt.Println("elapsed time:", bench())
}
Run it with different Go toolchain versions:
$ gotv 1.21. run demo-largesize-loop-var.go
21
[Run]: $HOME/.cache/gotv/tag_go1.21.8/bin/go run demo-largesize-loop-var.go
elapsed time: 1.829µs
$ gotv 1.22. run demo-largesize-loop-var.go
[Run]: $HOME/.cache/gotv/tag_go1.22.1/bin/go run demo-largesize-loop-var.go
elapsed time: 989.507µs
From the outputs, we can find that the semantic changes made in Go 1.22 causes significant per-
formance regression for the above code. So, when using Go toolchain 1.22+ versions, try not to
declare large-size values as loop variables.
22
Chapter 3
Memory Allocations
23
As heap allocations are much more expensive, only heap memory allocations contribute to the
allocation metrics in Go code benchmark results. But please note that allocating on stack still has
a cost, though it is often comparatively much smaller.
The escape analysis module of a Go compiler could detect some value parts will be only used by
one goroutine and try to let those value parts allocated on stack at run time if certain extra conditions
are satisfied. Stack memory allocations and escape analysis will be explained with more details in
the next chapter.
24
In other words, memory blocks are often larger than needed. The strategies are made to manage
memory easily and efficiently, but might cause a bit memory wasting sometimes (yes, a trade-off).
These could be proved by the following program:
package main
import "testing"
import "unsafe"
var t *[5]int64
var s []byte
func main() {
println(unsafe.Sizeof(*t)) // 40
rf := testing.Benchmark(f)
println(rf.AllocedBytesPerOp()) // 48
rg := testing.Benchmark(g)
println(rg.AllocedBytesPerOp()) // 40960
}
Another example:
package main
import "testing"
func main() {
br := testing.Benchmark(Concat)
println(br.AllocsPerOp()) // 3
println(br.AllocedBytesPerOp()) // 176
}
25
There are 3 allocations made within the Concat function. Two of them are caused by the byte slice
to string conversions string(s), and the sizes of the two memory blocks carrying the underlying
bytes of the two result strings are both 48 (which is the smallest size class which is not smaller than
33). The third allocation is caused by the string concatenation, and the size of the result memory
block is 80 (the smallest size class which is not smaller than 66). The three allocations allocate 176
(48+48+80) bytes totally. In the end, 14 bytes are wasted. And 44 (15 + 15 + 14) bytes are wasted
during executing the Concat function.
In the above example, the results of the string(s) conversions are used temporarily in the string
concatenation operation. By the current official standard Go compiler/runtime implementation
(1.22 versions), the string bytes are allocated on heap (see below sections for details). After the
concatenation is done, the memory blocks carrying the string bytes become into memory garbage
and will be collected eventually later.
import "testing"
26
}
}
import "fmt"
27
return [][]int{
{1, 2},
{9, 10, 11},
{6, 2, 3, 7},
{11, 5, 7, 12, 16},
{8, 5, 6},
}
}
func main() {
MergeWithOneLoop(getData())
}
The outputs (for the official standard Go compiler v1.22.1):
Allocate from 0 to 2 (when append slice#0).
Allocate from 2 to 6 (when append slice#1).
Allocate from 6 to 12 (when append slice#2).
Allocate from 12 to 24 (when append slice#3).
From the outputs, we could get that only the last append call doesn’t allocate.
In fact, the Merge_TwoLoops function could be faster in theory. As of the official standard Go
compiler version 1.22, the make call in the Merge_TwoLoop function will zero all just created
elements, which is actually unnecessary. Compiler optimizations in future versions might avoid
the zero operation.
BTW, the above implementation of the Merge_TwoLoops function has an imperfection. It doesn’t
handle the integer overflowing case. The following is a better implementation.
func Merge_TwoLoops(data ...[][]byte) []byte {
n := 0
for _, s := range data {
if k := n + len(s); k < n {
panic("slice length overflows")
} else {
n = k
}
}
28
r := make([]int, 0, n)
...
}
import "testing"
29
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = FilterOneAllocation(data)
}
}
import "testing"
const N = 100
//go:noinline
func CreateBooksOnOneLargeBlock(n int) []*Book {
books := make([]Book, n)
pbooks := make([]*Book, n)
for i := range pbooks {
pbooks[i] = &books[i]
}
return pbooks
30
}
//go:noinline
func CreateBooksOnManySmallBlocks(n int) []*Book {
books := make([]*Book, n)
for i := range books {
books[i] = new(Book)
}
return books
}
31
Despite it sometimes wastes more memory, generally speaking, allocating many small value parts
on one large memory block is comparatively better than allocating each of them on a separated
memory block. This is especially true when the life times of the small value parts are almost the
same, in which case allocating many small value parts on one large memory block could often
effectively avoid memory fragmentation.
32
defer npcPool.Unlock()
if npcPool.Len() == 0 {
return &NPC{}
}
return npcPool.Remove(npcPool.Front()).(*NPC)
}
33
Chapter 4
34
The basic escape analysis units are functions. Only the local values will be escape analyzed. All
package-level variables are allocated on heap for sure.
Value parts allocated on heap may be referenced by value parts allocated on either heap or stack,
but value parts allocated on a stack may be only referenced by value parts allocated on the same
stack. So if a value part is being referenced by another value part allocated in heap, then the former
one (the referenced one) must be also allocated on heap. This means value parts being referenced
by package-level variables must be heap allocated.
func main() {
var (
a = 1 // moved to heap: a
b = false
c = make(chan struct{})
)
go func() {
if b {
a++
}
close(c)
}()
<-c
println(a, b) // 1 false
}
Run it with the -m compiler option:
$ go run -gcflags=-m escape.go
# command-line-arguments
./escape.go:10:5: can inline main.func1
./escape.go:6:3: moved to heap: a
./escape.go:10:5: func literal escapes to heap
1 false
From the outputs, we know that the variable a escapes to heap but the variable b is allocated on
stack. What about the variable c? The direct part of channel c is allocated on stack. The indirect
parts of channels are always allocated on heap, so escape messages for channel indirect parts are
always omitted.
Why the variable b is allocated on stack but the variable a escapes? Aren’t they both used on two
goroutines? The reason is that the escape analysis module is so smart that it detects the variable b
is never modified and thinks it is a good idea to use a (hidden implicit) copy of the variable b in the
new goroutine.
Let’s add one new line b = !b before the print line and run it again.
// escape.go
package main
35
func main() {
var (
a = 1 // moved to heap: a
b = false // moved to heap: b
c = make(chan struct{})
)
go func() {
if b {
a++
}
close(c)
}()
<-c
b = !b
println(a, b) // 1 true
}
The outputs:
./escape.go:10:5: can inline main.func1
./escape.go:6:3: moved to heap: a
./escape.go:7:3: moved to heap: b
./escape.go:10:5: func literal escapes to heap
1 true
Now both the variable a and the variable b escape. In fact, for this specified example, the compiler
could still use a copy of variable b in the new goroutine. But it is too expensive to let the escape
analysis module analyze the concurrency synchronizations used in code.
For the similar reason, the escape analysis module also doesn’t try to check whether or not the
variable a will be really modified. If we change b to a constant, then the variable a will be allocated
on stack, because the line a++ will be optimized away.
import (
"fmt"
"math/rand"
)
36
a[i * 4 + 0] = byte(v >> 0)
a[i * 4 + 1] = byte(v >> 8)
a[i * 4 + 2] = byte(v >> 16)
a[i * 4 + 3] = byte(v >> 24)
}
var v = a[0]
for i := 1; i < len(a); i++ {
r += v ^ a[i]
v = a[i]
}
return
}
func main() {
x := foo(123)
fmt.Println(x)
duck()
}
var v interface{}
//go:noinline
func duck() {
if v != nil {
v = [16000]byte{}
panic("unreachable")
}
}
Run it with the -S compiler option, we will get the following outputs (some texts are omitted):
$ go run -gcflags=-S frame.go
...
... TEXT "".bar(SB), ABIInternal, $5024-32
...
... TEXT "".foo(SB), ABIInternal, $10056-8
...
... TEXT "".main(SB), ABIInternal, $64-0
...
... TEXT "".duck(SB), ABIInternal, $16024-0
...
From the outputs, we could get
37
• the frame size of the bar function is 5024 bytes.
• the frame size of the foo function is 10056 bytes.
• the frame size of the main function is 64 bytes.
• the frame size of the duck function is 16024 bytes. Please note that, although the duck
function is a de facto dummy function, its frame size is not zero. This fact will be made use
of in a code optimization trick shown later.
At run time, before entering the execution of a function call, Go runtime will mark out a memory
segment on the current stack for the call (to allocate stack memory blocks). The memory segment
is called the stack frame of the function call and its size is the stack frame size of the called function.
As mentioned above, the frame size is calculated at compile time.
When a value (part) within a function is determined to be allocated on stack, its memory address
offset (relative to the start of the stack frame of any call to the function) is also determined, at
compile time. At run time, once the stack frame of a call to the function is marked out, the mem-
ory addresses of all value parts allocated on the stack within the function call are all determined
consequently, which is why allocating memory blocks on stack is much faster than on heap.
38
import "runtime"
//go:noinline
func f(i int) byte {
var a [1<<13]byte // allocated on stack and make stack grow
return a[i]
}
func main(){
var x int
println(&x) // <address 1>
f(1) // (make stack grow)
println(&x) // <address 2>
runtime.GC() // (make stack shrink)
println(&x) // <address 3>
runtime.GC() // (make stack shrink)
println(&x) // <address 4>
runtime.GC() // (stack does not shrink)
println(&x) // <address 4>
}
Note that each of the first two manual runtime.GC calls causes a stack shrinkage, but the last one
doesn’t.
Let’s make an analysis on how the stack of the main goroutine grows and shrinks during the execu-
tion of the program.
1. At the starting, the initial stack size is 2KiB.
2. The frame size of the f function is 8216 bytes. Before entering the f function call, the stack
grows to 16KiB, which is the smallest 2n which is larger than the demand (10KiB or so).
3. After the f function call fully exits, the stack will not shrink immediately, until a garbage
collection cycle happens.
4. The first runtime.GC call shrinks the stack to 8KiB.
5. The second runtime.GC call shrinks the stack to 4KiB.
6. The third runtime.GC call doesn’t shrink the stack, though 2KiB is sufficient now. The
reason is the current official standard runtime implementation doesn’t shrink a stack to a size
which is less than 4 times of the demand.
The reason why the array a is allocated on stack will be explained in the a later section in this
chapter.
Though it happens rarely in practice, we should try to avoid making stack grow and shrink frequently.
That is, for some rare cases, allocating some value parts on stack might be not a good idea.
There is a global limit of stack size each goroutine may reach. If a goroutine exceeds the limit
while growing its stack, the whole program crashes. As of Go toolchain 1.22 versions, the default
maximum stack size is 1 GB (not GiB) on 64-bit systems, and 250 MB (not MiB) on 32-bit systems.
Please note that, the actual max stack size allowed by the current stack grow implementation is about
the half of the max stack size setting (512 MiB on 64-bit systems and 128 MiB on 32-bit systems).
The following is an example program which will crash for stack exceeds limit.
package main
39
func f(v [N]byte, n int) {
if n > 0 {
f(v, n-1)
}
}
func main() {
var x [N]byte
f(x, 50)
}
If the call f(x, 50) is changed to f(x, 48), then the program will exit without crashing (on 64-bit
systems).
We can call the runtime/debug.SetMaxStack function to change the global stack maximum limit
setting. There is not a formal way to control the initial stack size of a goroutine, though an informal
way will be provided in a later section of this chapter.
4.6 For all kinds of reasons, a value (part) will escape to heap
even if it is only used in one goroutine
The following sub-sections will introduce some such cases (not a full list of such caes).
func main() {
var x *int
for {
var n = 1 // moved to heap: n
x = &n
40
break
}
_ = x
}
4.6.2 The value parts referenced by an argument will escape to heap if the
argument is passed to interface method calls
For example, in the following code, the value x will be allocated on stack, but the value y will be
allocated on heap.
package main
type I interface {
M(*int)
}
type T struct{}
func (T) M(*int) {}
var t T
var i I = t
func main() {
var x int // does not escape
t.M(&x)
var y int // moved to heap: y
i.M(&y)
}
It is often impossible or too expensive for compilers to determine the dynamic value (and therefor
the concrete method) of an interface value, so the official standard compiler gives up to do so for
most cases. Potentially, the concrete method of the interface value could pass its arguments to some
other goroutines. So, for safety, the official standard compiler conservatively lets the value parts
referenced by the arguments escape to heap.
For some case, the compiler could determine the dynamic value (and therefor the concrete method)
of an interface value at compile time. If the compiler finds the concrete method doesn’t pass an
argument to other goroutines, then it will let the value parts referenced by the argument not escape
to heap. For example, in the following code, the value x and y are both allocated on stack. The
reason why the value y doesn’t escape is the method call i.M(&y) is de-virtualized to t.M(&y) at
compile time.
package main
type I interface{
M(*int)
}
type T struct{}
func (T) M(*int) {}
func main() {
var t T
41
var i I = t
var x int
t.M(&x)
var y int
i.M(&y)
}
import "reflect"
var x reflect.Value
func main() {
var n = 100000 // line 9
_ = reflect.ValueOf(&n)
var k = 100000
_ = reflect.ValueOf(k) // line 13
var q = 100000
x = reflect.ValueOf(q) // line 16
}
The outputs by using different Go toolchain versions:
$ gotv 1.20. run -gcflags=-m reflect-value-escape-analysis.go
[Run]: $HOME/.cache/gotv/tag_go1.20.9/bin/go run -gcflags=-m aaa.go
...
./reflect-value-escape-analysis.go:9:6: moved to heap: n
./reflect-value-escape-analysis.go:13:22: k escapes to heap
./reflect-value-escape-analysis.go:16:22: q escapes to heap
42
4.6.4 A call to the fmt.Print function makes the values referenced by its
arguments escape to heap
A such call will always make an allocation on heap to create a copy for each of its argument (if that
argument is not an interface) and makes the values referenced by its arguments escape to heap.
For example, in the following code, the variable x will escape to heap, but the variable y doesn’t.
And a copy of z is allocated on heap.
package main
import "fmt"
func main() {
var x = 1 << 20 // moved to heap: x
fmt.Println(&x)
var y = 2 << 20 // y does not escape
println(&y)
var z = 3 << 20
fmt.Println(z) // z escapes to heap
}
The same situation happen for other fmt.Print alike functions.
//go:noinline
func f(x *int) *int {
var n = *x + 1 // moved to heap: n
return &n
}
func main() {
var t = 1 // does not escape
var p = f(&t)
println(*p) // 2
println(&t) // 0xc000034758
println(p) // 0xc0000140c0
}
43
compiler might let some value parts which originally escape to heap be allocated on stack instead.
Still use the example in the last section as an example, but remove the //go:noinline line to
make the function f inline-able, then the value *p will be allocated on stack.
package main
func main() {
var t = 1
var p = f(&t)
println(*p) // 2
println(&t) // 0xc000034760
println(p) // 0xc000034768
}
The printed two addresses show the distance between the value t and *p is the size of the value
size of *p, which indicates the two values are both allocated on stack. Please note that the message
"moved to heap: n" is still there, but it is for the f function calls which are not inlined (no such
calls in this tiny program).
The following is the rewritten code (by the compiler) after inlining:
package main
func main() {
var t = 1
var s = &t
var n = *s + 1
var p = &n
println(*p)
println(&t) // 0xc000034760
println(p) // 0xc000034768
}
After the rewrite, the compiler easily knows the n variable is only used in the current goroutine so
lets it not escape.
44
func main() {
var x = createSlice(32) // line 9
var y = make([]byte, 32) // line 10
_, _ = x, y
}
Run it:
$ go run -gcflags="-m" constinline.go
# command-line-arguments
./constinline.go:4:6: can inline createSlice
./constinline.go:8:6: can inline main
./constinline.go:9:21: inlining call to createSlice
./constinline.go:5:13: make([]byte, n) escapes to heap
./constinline.go:9:21: make([]byte, n) escapes to heap
./constinline.go:10:14: make([]byte, 32) does not escape
Future versions of the official standard Go compiler might make improvements to avoid more un-
necessary escapes.
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
45
the following is an example which uses the trick:
package main
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
func main() {
var a = 1 // moved to heap: a
var b = true // moved to heap: b
escape(&a)
escape(&b)
println(a, b)
}
As the //go:noinline directive is only intended to be used in toolchain and standard packages
development, the code for the trick could be modified as the following in user code. The official
standard Go compiler will not inline calls to a function which will potentially call itself.
var sink interface{}
sink = x
sink = nil
}
Surely, if we know some values will be allocated on heap, then we can just let those values reference
the values expected to escape.
4.8.2 Use explicit value copies to help compilers detect some values don’t
escape
Let’s revisit the example used previously (with a small modification, b is initialized as true now):
package main
func main() {
var (
a = 1 // moved to heap: a
b = true // moved to heap: b
c = make(chan struct{})
)
go func() {
if b {
46
a++
}
close(c)
}()
<-c
b = !b
println(a, b) // 2 false
}
In this example, variables a and b both escape. If we modify the example by making copies of a
and b (in two different ways) like the following, then both a and b will not escape.
package main
func main() {
var (
a = 1 // doesn't escape
b = true // doesn't escape
c = make(chan int)
)
b1 := b
go func(a int) {
if b1 {
a++
c <- a
}
}(a)
a = <-c
b = !b
println(a, b) // 2 false
}
For this specified example, whether or not variables a and b escape has a very small effect on overall
program performance. But the tip introduced here might be helpful elsewhere.
4.8.3 Memory size thresholds used by the compiler to make allocation place-
ment decisions
To avoid crashes caused by goroutine stacks exceed the maximum stack size, even though the
compiler makes sure that a value part is used by only one goroutine, it will still let the value part
allocated on heap if the size of the value part is larger than a threshold. There are several such
thresholds used by the official standard Go compiler (v1.22.n):
• in a conversion between string and byte slice, it the result string or byte slice contains no
more than 32 bytes, then its indirect part (its underlying bytes) will be allocated on stack,
otherwise, on heap. However, the threshold of converting a constant string (to byte slice) is
relaxed to 64K (64 * 1024) bytes.
• if the size of a type T is larger than 64K (64 * 1024) bytes, then T values allocated by new(T)
and &T{} will be allocated on heap.
• if the size of the backing array [N]T of the result of the make([]T, N) call is larger than
64K (64 * 1024) bytes, then the backing array will be allocated on heap. Here, N is a constant
so that the compile could make decisions at compile time. If it is not a constant and larger
than zero, then the backing array will be always allocated on heap.
47
• in a variable declaration, if the size of the direct part of the variable is larger than 10M (10 *
1024 * 1024) bytes, then the direct part will be allocated on heap.
The following are some examples to show the effects of these thresholds.
Example 1:
package main
import "testing"
func h() {
_ = []byte(S) // ([]byte)(S) does not escape
}
func main() {
stat := func( f func() ) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(f(bs33))) // 1 (heap allocation)
println(stat(f(bs33[:32]))) // 0 (heap allocations)
48
Example 2:
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(new65537)) // 1 (heap allocation)
println(stat(new65535)) // 0 (heap allocations)
println(stat(comptr65537)) // 1 (heap allocation)
println(stat(comptr65535)) // 0 (heap allocations)
}
From the outputs of example 2, we could affirm that the T value created in new(T) and &T{} will
be always allocated on heap if the size of type T is larger than 64K (65536) bytes.
49
(Please note that, we deliberately ignore the case of size 65536. Before Go toolchain 1.17, T values
will be allocated on heap if the size of T is 65536. Go toolchain v1.17 changed this.)
Example 3:
package main
import "testing"
func main() {
stat := func( f func() bool) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(makeSlice65537)) // 1 (heap allocation)
println(stat(makeSlice65535)) // 0 (heap allocations)
println(stat(makeSliceVarSize)) // 1 (heap allocation)
}
From the outputs of example 3, we could affirm that:
• if N is a constant and the size of the backing array [N]T of the result of the make([]T, N)
call is larger than 64K (64 * 1024) bytes, then the backing array will be allocated on heap.
• if n is not a constant and n is larger than zero, then the backing array will be always allocated
on heap, for compilers couldn’t determine the backing array size of the result slice at compile
time.
(Again, please also note that, if the constant N equals to 65536, the elements of make([]T, N) will
50
be allocated on heap before Go toolchain v1.17, but will be allocated on stack since Go toolchain
v1.17.)
Example 4:
package main
import "testing"
func main() {
stat := func( f func() byte) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(declare10M)) // 0
println(stat(declare10Mplus1)) // 1
println(stat(redeclare10M)) // 0
println(stat(redeclare10Mplus1)) // 1
}
From the execution result of example 4, we could affirm that, in a variable declaration, if the size
of the direct part of the variable is larger than 10M (10 * 1024 * 1024) bytes, then the direct part
will be allocated on heap.
51
4.8.4 Use smaller thresholds
Please note, the above mentioned thresholds 64K and 10M will become much smaller (16K and
128K respectively) if the -smallframes compiler option is specified. For example, if we use the
command go run -gcflags='-smallframes' main.go to run the following program, then all
the checked functions will make one allocation on heap.
// main.go
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
52
println(stat(new16384))
println(stat(comptr16384))
println(stat(make16384))
println(stat(declare131073))
}
If we use the command go run main.go to run the program, then none of the checked functions
will make allocations on heap.
4.8.5 Allocate the backing array of a slice on stack even if its size is larger
than or equal to 64K (but not larger than 10M)
We have learned that the largest slice backing array which could be allocated on stack is 65536 (or
65535 before Go toolchain v1.17). But there is a tip to raise the limit to 10M: derive slices from a
stack allocated array. For example, the elements of the slice s created in the following program are
allocated on stack. The length of s is 10M, which is far larger than 65536.
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(f)) // 0
}
As of Go toolchain 1.22 versions, this tip still works.
4.8.6 Allocate the backing array of a slice with an arbitrary length on stack
It looks the escape analysis module in the current stanard Go compiler implementation (v1.22.n)
misses the case of using composite literals to create slices. When using the composite literal way
to create a slice, the elements of the slice will be always allocated on stack (if the compiler makes
sure that the elements of the slice will be used by only one goroutine), regardless of the length of
53
the slice and the size of the elements. For example, the following example program will allocate
the 500M elements of a byte slice on stack.
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(createSlice)) // 0 (heap allocations)
}
Future Go toolchain versions might change the implementation so that this tip might not work later.
But up to now (Go toolchain 1.22.n), this tip still works.
//go:noinline
func boxLargeSizeValue() {
var x interface{} = [N]byte{} // 1
println(x != nil)
}
//go:noinline
func largeSizeParameter(x [N]byte) { // 2
}
//go:noinline
func largeSizeElement() {
var m map[int][N]byte
54
m[0] = [N]byte{} // 3
}
However, as what has been mentioned in the chapter before the last, to avoid large value copy costs,
generally, we should not
• box large-size values into interfaces.
• use large-size types as function parameter types.
• use large-size types are map key and element types.
That is why the three cases shown in the above code snippet are corner cases. Generally, we should
not write such code in practice.
import "fmt"
import "time"
55
// Prevent this anonymous function being inlined.
recover()
}
}(nil)
demo(8192)
c <- time.Since(start)
}
func main() {
var c = make(chan time.Duration, 1)
go foo(c)
fmt.Println("foo:", <-c)
go bar(c)
fmt.Println("bar:", <-c)
}
Run it:
$ go run init-stack-size.go
foo: 42.051369ms
bar: 4.740523ms
From the outputs, we could get that the efficiency of the bar goroutine is higher than the foo
goroutine. The reason is simple, only one stack growth happens in the lifetime of the bar goroutine,
whereas more than 10 stack growths happen in the lifetime of the foo goroutine.
The official standard Go compiler decides to let a [1024 * 1024 * 64]byte value allocated on
the stack of the bar goroutine (in fact, this will not happen) and calculate the frame size of the
dummy anonymous function as 67108888 bytes (larger than 64 MiB), so the invocation of the
anonymous function makes the stack of the bar goroutine grow to the peak size (128 MiB) before
calling the demo function. Assuming no garbage collections happen during the lifetime of the
goroutine, the stack doesn’t need to grow any more.
At the beginning of the goroutine lifetime, the size of the used part of the stack is still small, so the
stack copy cost is small. Along with the used part becomes larger and larger, a stack growth will
copy more and more memory. So the sum of the stack copy costs in the foo goroutine is much
larger than the one single stack copy cost in the bar goroutine.
56
Chapter 5
Garbage Collection
This chapter doesn’t plan to explain the Garbage Collection (GC) implementation made by the
official standard Go runtime in detail. Only several facts in the GC implementation will be touched.
Within a GC cycle, there is a scan+mark phase and a sweep phase.
During the scan+mark phase of a GC cycle, the garbage collector will scan all pointers in the already-
known alive value parts to find more alive value parts (referenced by those already-known alive
ones), until no more alive value parts need to be scanned. In the process, all heap memory blocks
hosting alive value parts are marked as non-garbage memory blocks.
During the sweep phase of a GC cycle, the heap memory blocks which are not marked as non-
garbage will be viewed as garbage and collected.
So, generally speaking, the more pointers are used, the more pressure is made for GC (for the more
scan work to do).
5.1 GC pacer
At run time, a new GC cycle will start automatically when certain conditions are reached. The
current automatic GC pacer design includes the following scheduling strategies:
• When the scan+mark phase of a GC cycle is just done, Go runtime will calculate a target heap
size by using the configured GOGC percentage value. After the GC cycle, when the size of the
heap (approximately) exceeds the target heap size later, the next GC cycle will automatically
start. This strategy (called new heap memory percentage strategy below) will be described
with more details in a later section.
• A new GC cycle will also automatically start if the last GC cycle has ended for at least two
minutes. This is important for some value finalizers to get run and some goroutine stacks get
shrunk in time.
• Go toolchain 1.19 introduced a new scheduling strategy: the memory limit strategy. When
the total amount of memory Go runtime uses (approximately) surpasses the (soft) limit, a
new GC cycle will start.
(Note: The above descriptions are rough. The official standard Go runtime also considers some
other factors in the GC pacer implementation, which is why the above descriptions use the ”ap-
proximately” wording.)
The three strategies may take effects at the same time.
57
The second strategy is an auxiliary strategy. This article will not talk more about it. With the
other two strategies in play, the more frequently memory blocks are allocated on heap, the more
frequently GC cycles start.
Please note that, the current GC pacer design is not promised to be perfect for every use cases. It
will be improved constantly in future official standard Go compiler/runtime versions.
58
This book doesn’t provide suggestions to avoid memory fragments.
import (
"log"
"runtime"
"time"
)
func main() {
log.SetFlags(0)
var p = &s[999]
runtime.GC()
// log.Println(*p) // 999
_ = p
time.Sleep(time.Second)
}
Run this program, it will print:
element 999 is collected
element 998 is collected
element 997 is collected
...
element 1 is collected
element 0 is collected
59
The output indicates that the a new GC cycle (triggered by the manual runtime.GC call) makes
the memory block carrying the slice elements collected, otherwise the fanalizers of these elements
will not get executed.
Let’s turn on the log.Println(*p) line and run the program again, then it will merely print 999,
which indicates the memory block carrying the slice elements is still not collected when the manual
GC cycle ends. Yes, the fact that the last element in the slice will still be used prevents the whole
memory block from being collected.
We may copy the long-lived tiny value part (so that it will be carried on a small memory block) to
let the old larger memory block collected. For example, we may replace the following line in the
above program (with the log.Println(*p) line turned on):
var p = &a[999]
with
var v = a[999] // make a copy
var p = &v
, then the manual GC will make the memory block carrying the slice elements collected.
The return results of some functions in the standard strings (or bytes) packages are substrings
(or subslices) of some arguments passed to these functions. Such functions include Fields, Split
and Trim functions. Similarly, we should duplicate such a result substring (or subslice) if it is long
lived but its length is small, and the corresponding argument is short lived but has a large length;
otherwise, the memory block carrying the underlying bytes of (the result and the argument) will not
get collected in time.
60
on stack and the direct parts of global (package-level) variables. To simplify the implementation,
the official Go runtime views all the value parts allocated on stack as roots, including the value
parts which don’t contain pointers.
The memory blocks hosting roots are called root memory blocks. For the official standard Go
runtime, before version 1.18, roots have no impact on the new heap memory percentage strategy;
since version 1.18, they have.
The new heap memory percentage strategy is configured through a GOGC value, which may be set
via the GOGC environment variable or modified by calling the runtime/debug.SetGCPercent
function. The default value of the GOGC value is 100.
We may set the GOGC environment variable as off or call the runtime/debug.SetGCPercent
function with a negative argument to disable the new heap memory percentage strategy. Note:
doing these will also disable the two-minute auto GC strategy. To disable the new heap memory
percentage strategy solely, we may set the GOGC value as math.MaxInt64.
If the new heap memory percentage strategy is enabled (the GOGC value is non-negative), when
the scan+mark phase of a GC cycle is just done, the official standard Go runtime (version 1.18+)
will calculate the target heap size for the next garbage collection cycle, from the non-garbage heap
memory total size (call live heap here) and the root memory block total size (called GC roots here),
according to the following formula:
Target heap size = Live heap + (Live heap + GC roots) * GOGC / 100
For versions before 1.18, the formula is:
Target heap size = Live heap + (Live heap) * GOGC / 100
When heap memory total size (approximately) exceeds the calculated target heap size, the next GC
cycle will start automatically.
Note: the minimum target heap size is (GOGC * 4 / 100)MB, which is also the target heap size
for the first GC cycle.
We could use the gctrace=1 GODEBUG environment variable option to output a summary log
line for each GC cycle. Each GC cycle summary log line is formatted like the following text (in
which each # presents a number and ... means ignored messages by the current article).
gc # @#s #%: ..., #->#-># MB, # MB goal � # MB stacks � # MB globals � ...
The meaning of each field:
gc # the GC number, incremented at each GC
@#s elapsed time in seconds since program start
#% percentage of time spent in GC since program start
#->#-># MB heap sizes at GC start, scan end, and sweep end
# MB goal target heap size
# MB stacks estimated scannable stack size
# MB globals scannable global size
Note: in the following outputs of examples, to keep the each output line short, not all of these fields
will be shown.
Let’s use an unreal program as an example to show how to use this option.
// gctrace.go
package main
import (
61
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
garbageProducer() // never exit
}
Run it with the gctrace=1 GODEBUG environment variable option (several unrelated starting
lines are omitted in the outputs).
$ GODEBUG=gctrace=1 go run gctrace.go
...
gc 1 @0.017s 8%: ..., 3->4->4 MB, 4 MB goal, ...
gc 2 @0.037s 8%: ..., 7->8->4 MB, 8 MB goal, ...
gc 3 @0.064s 13%: ..., 8->9->4 MB, 9 MB goal, ...
gc 4 @0.108s 10%: ..., 9->9->0 MB, 10 MB goal, ...
gc 5 @0.127s 10%: ..., 3->4->3 MB, 4 MB goal, ...
gc 6 @0.155s 10%: ..., 6->7->2 MB, 8 MB goal, ...
gc 7 @0.175s 10%: ..., 5->5->4 MB, 5 MB goal, ...
gc 8 @0.206s 10%: ..., 8->8->3 MB, 9 MB goal, ...
gc 9 @0.232s 10%: ..., 5->6->4 MB, 6 MB goal, ...
gc 10 @0.269s 12%: ..., 8->10->10 MB, 9 MB goal, ...
...
(On Windows, the DOS command should be set "GODEBUG=gctrace=1" & "go run
gctrace.go".)
Here, the #->#-># MB and # MB goal fields are what we have most interests in. In a #->#->#
MB field,
• the last number is non-garbage heap memory total size (a.k.a. live heap).
• the first number is the heap size at a GC cycle start, which should be approximately equal to
the target heap size (the number in the # MB goal field of the same line).
From the outputs, we could find that
• the live heap sizes stagger much (yes, the above program is deliberately designed as such),
62
so the GC cycle intervals also stagger much.
• the GC cycle frequency is so high that the percentage of time spent on GC is too high. The
above outputs show the percentage varies from 8% to 12%.
The reason of the findings is that the live heap size is small (staying roughly under 5MiB) and
staggers much, but the root memory block total size is almost zero.
One way to reduce the time spent on GC is to increase the GOGC value:
$ GOGC=1000 GODEBUG=gctrace=1 go run gctrace.go
...
gc 1 @0.074s 2%: ..., 38->43->15 MB, 40 MB goal, ...
gc 2 @0.810s 1%: ..., 160->163->9 MB, 167 MB goal, ...
gc 3 @1.285s 1%: ..., 105->107->11 MB, 109 MB goal, ...
gc 4 @1.835s 1%: ..., 125->128->10 MB, 129 MB goal, ...
gc 5 @2.331s 1%: ..., 114->117->18 MB, 118 MB goal, ...
gc 6 @3.250s 1%: ..., 199->201->8 MB, 204 MB goal, ...
gc 7 @3.703s 1%: ..., 96->98->18 MB, 100 MB goal, ...
gc 8 @4.580s 1%: ..., 201->204->10 MB, 207 MB goal, ...
gc 9 @5.111s 1%: ..., 118->119->3 MB, 122 MB goal, ...
gc 10 @5.306s 1%: ..., 43->43->4 MB, 44 MB goal, ....
...
(On Windows, the DOS command should be set "GOGC=1000" & set "GODEBUG=gctrace=1"
& "go run gctrace.go".)
From the above outputs, we could find that, after increase the GOGC value to 1000, GC cycle intervals
and the heap sizes at GC cycle beginning both become much larger now. But GC cycle intervals
and live heap sizes still stagger much, which might be a problem for some programs. The following
sections will introduce some ways to solve the problems.
import (
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
63
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
return s[v]
}
func main() {
go bigStack(nil, 123)
garbageProducer() // never exit
}
Run it, get the following outputs:
$ go version
go version go1.21.2 linux/amd64
64
$ GODEBUG=gctrace=1 go run bigstacks.go
...
gc 1 @0.015s ..., 4 MB goal, 256 MB stacks, ...
gc 2 @1.597s ..., 263 MB goal, 256 MB stacks, ...
gc 3 @2.772s ..., 258 MB goal, 256 MB stacks, ...
gc 4 @3.945s ..., 258 MB goal, 256 MB stacks, ...
gc 5 @5.239s ..., 288 MB goal, 256 MB stacks, ...
gc 6 @6.122s ..., 264 MB goal, 256 MB stacks, ...
gc 7 @7.384s ..., 279 MB goal, 256 MB stacks, ...
gc 8 @8.796s ..., 288 MB goal, 256 MB stacks, ...
gc 9 @10.102s ..., 291 MB goal, 256 MB stacks, ...
gc 10 @10.951s ..., 262 MB goal, 256 MB stacks, ...
...
The following program uses a huge-size package-level variable which contains a pointer, so that
the root memory block total size of the program is also huge.
// bigglobals.go
package main
import (
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
garbageProducer() // never exit
println(bigGlobal.p) // unreachable
}
Run it, get the following outputs:
65
$ GODEBUG=gctrace=1 go run bigglobals.go
...
gc 1 @0.266s 0%: ..., 150 MB goal, ..., 150 MB globals, ...
gc 2 @0.885s 0%: ..., 154 MB goal, ..., 150 MB globals, ...
gc 3 @1.616s 0%: ..., 168 MB goal, ..., 150 MB globals, ...
gc 4 @2.327s 0%: ..., 165 MB goal, ..., 150 MB globals, ...
gc 5 @3.011s 0%: ..., 168 MB goal, ..., 150 MB globals, ...
gc 6 @3.493s 0%: ..., 155 MB goal, ..., 150 MB globals, ...
gc 7 @4.208s 0%: ..., 167 MB goal, ..., 150 MB globals, ...
gc 8 @4.897s 0%: ..., 162 MB goal, ..., 150 MB globals, ...
gc 9 @5.618s 0%: ..., 165 MB goal, ..., 150 MB globals, ...
gc 10 @6.338s 0%: ..., 169 MB goal, ..., 150 MB globals, ...
...
The same, the outputs show that GC cycle intervals become larger and much less staggering (with
Go toolchain version 1.18+).
The examples in the current section show that root memory blocks objectively act as memory bal-
lasts.
The next section will introduce a memory ballast trick which works with Go toolchain version 1.18-.
func main() {
// ballastSize is value much larger than the
// maximum possible live heap size of the program.
ballast := make([]byte, ballastSize)
programRun()
runtime.KeepAlive(&ballast)
}
The trick allocate a slice with a big element sum size. The size contributes to non-garbage heap
size.
Let’s modify the gctrace example shown above as:
// gcballast.go
package main
import (
"math/rand"
"runtime"
"time"
)
var x [512][]*int
66
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
const ballastSize = 150 << 20 // 150 MiB
ballast := make([]byte, ballastSize)
garbageProducer()
runtime.KeepAlive(&ballast)
}
This program uses a 150MiB memory ballast, so that the non-garbage heap size (live heap) of the
program keeps at about 150-160MiB. Consequently, the target heap size of the program keeps at
a bit over 300MiB (assume the GOGC value is 100). This makes GC cycle intervals become larger
and much less staggering.
Run it to verify the effect:
$ GODEBUG=gctrace=1 go run gcballast.go
...
gc 1 @0.005s 0%: ..., 150->150->150 MB, 4 MB goal, ...
gc 2 @0.333s 5%: ..., 255->261->171 MB, 300 MB goal, ...
gc 3 @0.989s 2%: ..., 293->296->161 MB, 344 MB goal, ...
gc 4 @1.554s 1%: ..., 276->276->152 MB, 324 MB goal, ...
gc 5 @2.065s 1%: ..., 260->261->159 MB, 305 MB goal, ...
gc 6 @2.595s 1%: ..., 271->271->152 MB, 318 MB goal, ...
gc 7 @3.082s 1%: ..., 259->267->173 MB, 305 MB goal, ...
gc 8 @3.708s 1%: ..., 295->296->155 MB, 347 MB goal, ...
gc 9 @4.214s 1%: ..., 265->266->153 MB, 311 MB goal, ...
gc 10 @4.731s 1%: ..., 262->262->156 MB, 307 MB goal, ...
...
Note: the elements of the local slice is never used, so the allocated memory block for the elements
is only allocated virtually, not physically (at least on Linux). This means the elements of the slice
don’t consume physical memory, which is an advantage over using root memory blocks as memory
ballasts.
67
5.10 Use Go toolchain 1.19 introduced memory limit strategy
to avoid frequent GC cycles
Go official standard compiler 1.19 introduced a new scheduling strategy: the memory limit strategy.
The strategy may be configured either via the GOMEMLIMIT environment variable or through the
runtime/debug.SetMemoryLimit function. This memory limit sets a maximum on the total amount
of memory that the Go runtime should use. In other words, if the total amount of memory Go
runtime uses (approximately) surpasses the limit, a new garbage collection process will start. The
limit is soft, a Go program will not exit when this limit is exceeded. The default value of the memory
limit is math.MaxInt64, which effectively disables this strategy.
The value of the GOMEMLIMIT environment variable may have an optional unit suffix. The supported
suffixes include B, KiB, MiB, GiB, and TiB. A value without a unit means B (bytes).
The memory limit strategy and the new heap memory percentage strategy may take effect together.
For demonstration purpose, let’s disable the new heap memory percentage strategy and enable the
memory limit strategy to run the gctrace example program shown above again. Please make sure
to run the program with Go toolchain v1.19+; otherwise, the GOMEMLIMIT environment variable
will not get recognized so that automatic garbage collection will be turned off totally.
$ go version
go version go1.19 linux/amd64
68
Chapter 6
Pointers
import "testing"
const N = 1000
var a [N]int
//go:noinline
func g0(a *[N]int) {
for i := range a {
a[i] = i // line 12
}
}
//go:noinline
func g1(a *[N]int) {
_ = *a // line 18
for i := range a {
a[i] = i // line 20
}
}
69
Let’s run the benchmarks with the -S compiler option, the following outputs are got (uninterested
texts are omitted):
$ go test -bench=. -gcflags=-S unnecessary-checks.go
...
0x0004 00004 (unnecessary-checks.go:12) TESTB AL, (AX)
0x0006 00006 (unnecessary-checks.go:12) MOVQ CX, (AX)(CX*8)
...
0x0000 00000 (unnecessary-checks.go:18) TESTB AL, (AX)
0x0002 00002 (unnecessary-checks.go:18) XORL CX, CX
0x0004 00004 (unnecessary-checks.go:19) JMP 13
0x0006 00006 (unnecessary-checks.go:20) MOVQ CX, (AX)(CX*8)
...
Benchmark_g0-4 494.8 ns/op
Benchmark_g1-4 399.3 ns/op
From the outputs, we could find that the g1 implementation is more performant than the g0 im-
plementation, even if the g1 implementation contains one more code line (line 18). Why? The
question is answered by the outputted assembly instructions.
In the g0 implementation, the TESTB instruction is generated within the loop, whereas in the g1
implementation, the TESTB instruction is generated out of the loop. The TESTB instruction is used
to check whether or not the argument a is a nil pointer. For this specified case, checking once is
enough. The one more code line avoids the flaw in the compiler implementation.
There is a third implementation which is as performant as the g1 implementation. The third imple-
mentation uses a slice derived from the array pointer argument.
//go:noinline
func g2(x *[N]int) {
a := x[:]
for i := range a {
a[i] = i
}
}
Please note that the flaw might be fixed in future compiler versions.
And please note that, if the three implementation functions are inline-able, the benchmark results
will change much. That is the reason why the //go:noinline compiler directives are used here.
(Before Go toolchain v1.18, the //go:noinline compiler directives are actually unnecessary here.
Because Go toolchain v1.18- never inlines a function containing a for-range loop.)
//go:noinline
func f0(t *T) {
70
for i := range t.a {
t.a[i] = i
}
}
//go:noinline
func f1(t *T) {
_ = *t.a
for i := range t.a {
t.a[i] = i
}
}
To move the nil array pointer checks out of the loop, we should copy the t.a field to a local variable,
then adopt the trick introduced above:
//go:noinline
func f3(t *T) {
a := t.a
_ = *a
for i := range a {
a[i] = i
}
}
Or simply derive a slice from the array pointer field:
//go:noinline
func f4(t *T) {
a := t.a[:]
for i := range a {
a[i] = i
}
}
The benchmark results:
Benchmark_f0-4 622.9 ns/op
Benchmark_f1-4 637.4 ns/op
Benchmark_f2-4 511.3 ns/op
Benchmark_f3-4 390.1 ns/op
Benchmark_f4-4 387.6 ns/op
The results verify our previous conclusions.
Note, the f2 function mentioned in the benchmark results is declared as
//go:noinline
func f2(t *T) {
a := t.a
for i := range a {
a[i] = i
}
}
The f2 implementation is not fast as the f3 and f4 implementations, but it is faster than the f0 and
f1 implementations. However, that is another story.
71
If the elements of an array pointer field are not modified (only read) in the loop, then the f1 way is
as performant as the f3 and f4 way.
Personally, for most cases, I think we should try to use the slice way (the f4 way) to get the best
performance, because generally slices are optimized better than arrays by the official standard Go
compiler.
import "testing"
//go:noinline
func f(sum *int, s []int) {
for _, v := range s { // line 8
*sum += v // line 9
}
}
//go:noinline
func g(sum *int, s []int) {
var n = *sum
for _, v := range s { // line 16
n += v // line 17
}
*sum = n
}
72
...
0x0009 00009 (avoid-indirects_test.go:9) MOVQ (AX), SI
0x000c 00012 (avoid-indirects_test.go:9) ADDQ (BX)(DX*8), SI
0x0010 00016 (avoid-indirects_test.go:9) MOVQ SI, (AX)
0x0013 00019 (avoid-indirects_test.go:8) INCQ DX
0x0016 00022 (avoid-indirects_test.go:8) CMPQ CX, DX
0x0019 00025 (avoid-indirects_test.go:8) JGT 9
...
0x000b 00011 (avoid-indirects_test.go:16) MOVQ (BX)(DX*8), DI
0x000f 00015 (avoid-indirects_test.go:16) INCQ DX
0x0012 00018 (avoid-indirects_test.go:17) ADDQ DI, SI
0x0015 00021 (avoid-indirects_test.go:16) CMPQ CX, DX
0x0018 00024 (avoid-indirects_test.go:16) JGT 11
...
Benchmark_f-4 3024 ns/op
Benchmark_g-4 566.6 ns/op
The outputted assembly instructions show the pointer sum is dereferenced within the loop in the
f function. A dereference operation is a memory operation. For the g function, the dereference
operations happen out of the loop, and the instructions generated for the loop only process registers.
It is much faster to let CPU instructions process registers than process memory, which is why the g
function is much more performant than the f function.
This is not a compiler flaw. In fact, the f and g functions are not equivalent (though for most use
cases in practice, their results are the same). For example, if they are called like the following code
shows, then they return different results (thanks to skeeto@reddit for making this correction).
{
var s = []int{1, 1, 1}
var sum = &s[2]
f(sum, s)
println(*sum) // 6
}
{
var s = []int{1, 1, 1}
var sum = &s[2]
g(sum, s)
println(*sum) // 4
}
Another performant implementation for this specified case is to move the pointer parameter out of
the function body (again, it is not totally equivalent to either f or g function):
//go:noinline
func h(s []int) int {
var n = 0
for _, v := range s {
n += v
}
return n
}
73
*sum += h(s)
...
}
74
Chapter 7
Structs
import "testing"
const N = 1000
type T struct {
x int
}
//go:noinline
func f(t *T) {
t.x = 0
for i := 0; i < N; i++ {
t.x += i
}
}
//go:noinline
func g(t *T) {
var x = 0
for i := 0; i < N; i++ {
x += i
}
t.x = x
}
75
var t = &T{}
76
Chapter 8
import "testing"
type T [1000]byte
var x T
var r bool
77
}
The benchmark results:
Benchmark_CompareWithLiteral-4 21214032 52.18 ns/op
Benchmark_CompareWithGlobalVar-4 36417091 31.03 ns/op
By using the -S compiler option, we could find that the compile generates less instructions for
the function CompareWithGlobalVar than the function CompareWithLiteral. That is why the
function CompareWithGlobalVar is more performant.
For small-size arrays, the performance difference between the two functions is small.
Please note that future compiler versions might be improved to remove the performance difference
between the two functions.
// case 1:
y = make([]T, len(s)) // works
copy(y, s)
// case 2:
y = make([]T, len(s)) // not work
78
_ = copy(y, s)
// case 3:
y = make([]T, len(s)) // not work
f(copy(y, s))
// case 4:
y = make([]T, len(s), len(s)) // not work
copy(y, s)
// case 5:
var a = [1][]T{s}
y = make([]T, len(a[0])) // not work
copy(y, a[0])
// case 6:
type SS struct {x []T}
var ss = SS{x: s}
y = make([]T, len(ss.x)) // not work
copy(y, ss.x)
The capacity of the result of a make call is exactly the argument passed to the make call. For
example, cap(make([]T, n) == n and cap(make([]T, n, m) == m. This means there might
be some bytes are wasted in the memory block hosting the elements of the result.
If an append call needs to allocate, then the capacity of the result slice of the append call is unspec-
ified. The capacity is often larger than length of the result slice. Assume the result of the append
call is assigned to a slice s, then the elements within s[len(s):cap(s)] will get zeroed in the
append call. The other elements will be overwritten by the elements of the argument slices. For
example, in the following code, the elements within s[len(x)+len(y):] will get zeroed in the
append call.
s = append(x, y...)
If the append call in the new = append(old, values...) statement allocates, then the capacity
of the result slice new is determined by the following shown algorithm (assume elementSize is
not zero) in Go 1.17:
var newcap int
var required = old.len + values.len
if required > old.cap * 2 {
newcap = required
} else {
if old.cap < 1024 {
newcap = old.cap * 2
} else {
newcap = old.cap
for 0 < newcap && newcap < required {
newcap += newcap / 4
}
// Avoid overflowing.
if newcap <= 0 {
newcap = required
}
79
}
}
func main() {
x1 := make([]int, 897)
x2 := make([]int, 1024)
y := make([]int, 100)
println(cap(append(x1, y...)))
println(cap(append(x2, y...)))
}
With Go toolchain v1.17, the above example prints:
2048
1280
With Go toolchain v1.18+, the above example prints:
1360
1536
That is because Go 1.18 removes this drawback by tweaking the algorithm a bit:
var newcap int
var required = old.len + values.len
if required > old.cap * 2 {
newcap = required
} else {
const threshold = 256
if old.cap < threshold {
newcap = old.cap * 2
} else {
newcap = old.cap
for 0 < newcap && newcap < required {
newcap += (newcap + 3*threshold) / 4
}
// Avoid overflowing.
if newcap <= 0 {
newcap = required
}
}
}
The new algorithm in Go 1.18+ often allocates less memory than the old one in Go 1.17.
80
Please note, each slice growth needs one memory allocation. So we should try to grow slices with
less times in programming.
Another subtle difference (up to Go toolchain 1.22) between the copy and append functions is that
the copy function will not copy elements when it detects that the addresses of the first elements
of its two slice parameters are identical, yet the append function never performs such detections.
This means, in the following code, the copy call is much more efficient than the append call.
var x = make([]T, 10000)
copy(x, x)
_ = append(x[:0], x...)
A concrete example showing this difference in practice is, in the two DeleteSliceElements func-
tions shown below, when i == j, the implementation using copy is much more performant than
the implementation using append.
func DeleteSliceElements(s []T, i, j int) []T {
_ = s[i:j] // bounds check
return append(s[:i], s[j:]...)
}
func main() {
x := make([]byte, 100, 500)
y := make([]byte, 500)
a := append(x, y...)
b := append(x[:len(x):len(x)], y...)
println(cap(a)) // 1024
println(cap(b)) // 640
}
The outputs shown as comments are for Go 1.17. For Go 1.18, instead, the above program prints:
896
640
Surely, if we confidently know that the free capacity of the first argument slice of an append call
is enough to hold all appended elements, then we should not clip the first argument.
81
8.4 Grow slices (enlarge slice capacities)
There are two ways to grow the capacity of a slice x to c if the backing array of the slice is needed
to be re-allocated in the growth.
// way 1
func Grow_MakeCopy(x []T, c int) []T {
r := make([]T, c)
copy(r, x)
return r[:len(x)]
}
// way 2
func Grow_Oneline(x []T, c int) []T {
return append(x, make([]T, c - len(x))...)[:len(x)]
}
Both of the two ways are specially optimized by the official standard Go compiler. As mentioned
above, the make call in way 1 doesn’t reset elements within r[:len(x)]. In way 2, the make call
doesn’t make allocations at all.
In theory, with the two optimizations, the two ways have comparable performance. But benchmark
results often show way 1 is a little more performant.
Note that, before the official standard Go compiler implementation v1.20, the optimization for way
2 doesn’t work if the type of the first argument slice of the append call is a named type. For exam-
ple, the following Grow_Oneline_Named function is much slower than the above Grow_Oneline
function.
type S []T
82
sCloned = make([]T, len(s))
copy(sCloned, s)
For many cases, the make+copy way is a little faster than the following append way, because as
mentioned above, an append call might allocate and zero some extra elements.
sCloned = append([]T(nil), s...)
For example, in the following code, 8191 extra elements are allocated and zeroed.
x := make([]byte, 1<<15+1)
y := append([]byte(nil), x...)
println(cap(y) - len(x)) // 8191
func main() {
x := make([]int, 98)
y := make([]int, 666)
a := append(x, y...)
b := append(y, x...)
println(cap(a)) // 768
println(cap(b)) // 1360
}
The outputs shown as comments are for Go 1.17. For Go 1.18, instead, the above program prints:
768
1024
83
If the free element slots in slice x are enough to hold all elements of slice y and it is allowed to let
the result slice and x share elements, then append(x, y...) is the most performant way, for it
doesn’t allocate.
84
If the insertion operations are performed frequently, please consider using insertion-friendly data
structure (such as linked list) instead.
import "testing"
//go:noinline
func sum_forrange1(s []int) int {
var n = 0
for i := range s {
n += s[i]
}
return n
}
//go:noinline
func sum_forrange2(s []int) int {
var n = 0
for _, v := range s {
85
n += v
}
return n
}
//go:noinline
func sum_plainfor(s []int) int {
var n = 0
for i := 0; i < len(s); i++ {
n += s[i]
}
return n
}
86
for i := range aSliceOrArray {
aSliceOrArray[i] = v0
}
This optimization also works if two iteration variables present but the second one is the blank iden-
tifier _.
For most cases, the above code is more performant than the following code:
for i := 0; i < len(aSliceOrArray); i++ {
aSliceOrArray[i] = v0
}
On my machine, the memclr way is slower only if the length of the array of slice is smaller than 6
(element type is byte).
Before Go toolchain v1.19, the ranged container must be an array or slice to make this optimization
work. Since Go toolchain v1.19, it may be also a pointer to an array.
In fact, this optimization is more meaningful for slices than for arrays and array pointers, as there
is a more simple (and sometimes more performant) way to reset array elements:
anArray = ArrayType{}
*anArrayPointer = ArrayType{}
Note: Go 1.21 added a new built-in function, clear, which may be used to reset all the ele-
ments in a slice. So since Go 1.21, we should try to use the clear function instead of relying
on the memclr optimization to reset slice elements.
import "testing"
const N = 1 << 10
var s = make([]byte, N)
var r = make([]byte, N/4)
87
j++
}
}
88
Chapter 9
89
_, _ = i, b
}
}
import t "testing"
const x = "abcdefghijklmnopqrstuvwxyz0123456789"
var y = "abcdefghijklmnopqrstuvwxyz0123456789"
func rangeNonConstant() {
for range []byte(y) {}
}
func convertConstant() {
_ = []byte(x)
}
func convertNonConstant() {
_ = []byte(y)
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(
stat(rangeNonConstant),
stat(convertConstant),
stat(convertNonConstant),
)
}
The outputs with different toolchain versions:
$ gotv 1.6. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.6.4/bin/go run string-2-bytes.go
1 1 1
90
$ gotv 1.7. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.7.6/bin/go run string-2-bytes.go
0 1 1
$ gotv 1.11. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.11.13/bin/go run string-2-bytes.go
0 1 1
$ gotv 1.12. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.12.17/bin/go run string-2-bytes.go
0 0 1
$ gotv 1.21. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.21.8/bin/go run string-2-bytes.go
0 0 1
$ gotv 1.22. run string-2-bytes.go
[Run]: $HOME/.cache/gotv/tag_go1.22.1/bin/go run string-2-bytes.go
0 0 0
Sadly, the latest compiler (v1.22.n) is still not smart enough to remove the byte duplication in the
conversions shown in the following code:
package main
import "bytes"
import t "testing"
var y = "abcdefghijklmnopqrstuvwxyz0123456789"
var s = []byte(y)
func compareNonConstants() {
_ = bytes.Compare([]byte(y), []byte(y))
}
func concatStringAndBytes() {
_ = append([]byte(y), s...)
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(compareNonConstants)) // 2
println(stat(concatStringAndBytes)) // 2
}
91
}
This optimization leads to the verbose function is more efficient than the clean function shown
in the following code (as of the official standard Go compiler v1.22.n):
package main
import t "testing"
func main() {
x := []byte{1023: 'x'}
y := []byte{1023: 'y'}
z := []byte{1023: 'z'}
stat := func(f func(x, y, z []byte)) int {
allocs := t.AllocsPerRun(10, func() {
f(x, y, z)
})
return int(allocs)
}
println(stat(verbose)) // 0
println(stat(clean)) // 3
}
From the outputs, we could get that the the verbose function doesn’t make allocations but the
simple function makes three ones, which is just the reason why the former one is more performant.
The performance difference between the two functions might be removed since a future Go
toolchain version.
We could also use the bytes.Compare function to compare two byte slices. The bytes.Compare
function way is often more performant for the cases in which three-way comparisons (like the
following code shows) are needed.
// Note, two branches are enough
// to form a three-way comparison.
func doSomething(x, y []byte) {
switch bytes.Compare(x, y) {
92
case -1:
// ... do something 1
case 1:
// ... do something 2
default:
// ... do something 3
}
}
Don’t use the bytes.Compare function in simple (one-way) byte slice comparisons, as the follow-
ing code shows. It is slower for such cases.
func doSomething(x, y []byte) {
if bytes.Compare(x, y) == 0 {
... // do something
}
}
import t "testing"
var m = map[string]int{}
var key = []byte{'k', 'e', 'y'}
var n int
func get() {
n = m[string(key)]
}
func inc() {
m[string(key)]++
}
func set() {
m[string(key)] = 123
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(get)) // 0
println(stat(set)) // 1
println(stat(inc)) // 1
93
}
This optimization also works if the key presents as a struct or array composite literal form: T1{...
Tn{..., string(key), ...} ...}, where Tx is either a struct type or an array type. For ex-
ample, the conversion string(key) in the following code doesn’t do duplications, too.
package main
import t "testing"
type T struct {
a int
b bool
k [2]string
}
var m = map[T]int{}
var key = []byte{'k', 'e', 'y', 99: 'z'}
var n int
func get() {
n = m[T{k: [2]string{1: string(key)}}]
}
func main() {
print(int(t.AllocsPerRun(10, get))) // 0
}
This optimization leads to an interesting case. In the following code snippet, the function modify1
makes one allocation but the function modify2 makes none, so the function modify2 is more
performant than the function modify1. The reason could be easily found out from their respective
equivalent forms. The string(key) used in the function modify2 only appears in a map element
retrieval expression, whereas the string(key) used in the function modify1 should be thought
as appearing in a map element modification statement.
package main
import t "testing"
var m1 = map[string]int{"key": 0}
func modify1() {
m1[string(key)]++
// (logically) equivalent to:
// m1[string(key)] = m1[string(key)] + 1
}
94
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(modify1)) // 1
println(stat(modify2)) // 0
}
So if the entries of a map are seldom deleted but the elements of the map are modified frequently,
it is best to use a pointer type as the map element type.
import "testing"
var x string
func stat(add func() string) int {
c := func() {
x = add()
}
allocs := testing.AllocsPerRun(10, c)
return int(allocs)
}
func main() {
println(stat(f)) // 1
println(stat(g)) // 3
}
Please note that, currently (Go toolchain 1.22 versions), this optimization is only useful for byte
slices with lengths larger than 32. If we change the length of the string s to 32 (by declaring it with
var s = []byte{31: 'x'}), then the performance difference between the functions f and g will
become neglectable. Please read the next section for the reason.
95
The a bit verbose way actually has a drawback: it wastes at least one byte more memory. If, at
coding time, we know the byte value at a specified index of one operand, the this drawback could
be avoided. For example, assume we know the first byte of the first operand is always $, then we
could modify the a bit verbose way as the following code shows, to avoid wasting more memory.
func f() string {
return "$" + string(s[1:]) + string(s)
}
Please note that, this optimization is some unintended. It might be not supported any more since a
future Go toolchain version.
9.1.6 If the result of an operation is a string or byte slice, and the length of
the result is larger than 32, then the byte elements of the result will be
always allocated on heap
In fact, recall that the example in the last section uses a byte slice with 33 bytes, the reason is to
avoid allocating the byte elements of the string concatenation operands on stack.
In the following program, the function g needs 3 heap allocations, but the function f needs none.
The only differences between the two functions are the lengths of the involved byte slice and strings.
The function f actually makes 3 stack allocations, but the function testing.AllocsPerRun only
counts heap allocations.
package main
import "testing"
func f() {
x := str + str // does not escape
y := []byte(x) // does not escape
println(len(y), cap(y)) // 32 32
z := string(y) // does not escape
println(len(x), len(z)) // 32 32
}
func g() {
x := str + str + "x" // does not escape
y := []byte(x) // does not escape
println(len(y), cap(y)) // 33 48
z := string(y) // does not escape
println(len(x), len(z)) // 33 33
}
func main() {
println(stat(f)) // 0
96
println(stat(g)) // 3
}
In the following benchmark code, the concat_splited way is more performant than the normal
concat way, because the conversions in the former way don’t make heap allocations.
package bytes
import "testing"
97
package bytes
import "testing"
import "strings"
var s1 = strings.Repeat("a", M)
var s2 = strings.Repeat("a", N)
var s3 = strings.Repeat("a", K)
var r1, r2 string
func init() {
println("======", M, N, K)
}
//go:noinline
func Concat_WithPlus(a, b, c string) string {
return a + b + c
}
//go:noinline
func Concat_WithBuilder(ss ...string) string {
var b strings.Builder
var n = 0
for _, s := range ss {
n += len(s)
}
b.Grow(n)
for _, s := range ss {
b.WriteString(s)
}
return b.String()
}
98
Benchmark_Concat_WithBuilder-4 196.8 ns/op
So if it is possible to do the whole concatenation in one statement, generally, we should use the +
operator to concatenate strings for its simplicity and great performance.
import "testing"
//go:noinline
func Concat_WithPlus(a, b, c, d string) string {
return a + b + c + d
}
//go:noinline
func Concat_WithBytes(ss ...string) string {
var n = 0
for _, s := range ss {
n += len(s)
}
var bs []byte
99
if n > 64 {
bs = make([]byte, 0, n) // escapes to heap
} else {
bs = make([]byte, 0, 64) // does not escape
}
for _, s := range ss {
bs = append(bs, s...)
}
return string(bs)
}
9.3 Merge a string and a byte slice into a new byte slice
Sometimes, we need to merge a string (str) a byte slice (bs) into a new byte slice. There are two
ways to achieve this goal.
Way 1 (the one-line way):
var newByteSlice = append([]byte(str), bs...)
Way 2 (the verbose way):
var newByteSlice = make([]byte, len(str) + len(bs))
copy(newByteSlice, str)
copy(newByteSlice[len(str):], bs)
Generlaly, if the length of the string is much larger than the byte slice, then the verbose way is more
performant. On the contrary, if the length of the byte slice is much larger than the string, then the
one-line way is more performant.
Sadly, currently (Go toolchain 1.22 versions), there is not an extreme performant way to do the
merge.
100
• In the one-line way, the conversion []byte(str) will duplicate the underlying of the string,
which is unnecessarily.
• In the verbose way, the elements within newByteSlice[len(str):] will be unnecessarily
zeroed in the make call.
101
As the strings.Compare function is not very efficient now, some people would use comparison
operators to do the job instead. But please note that, if the lengths of the comparison operands are
often not equal, then we should not handle the ”equal” cases in the default branch. For example, in
the following code, the function f3 is often less performant than the other two functions. The reason
is the comparison x == y is much faster than x < y and x > y if the lengths of the comparison
operands are not equal.
func f1(x, y string) {
switch {
case x == y: // ... handle 1
case x < y: // ... handle 2
default: // ... handle 3
}
}
102
func f(a, b, c string) {
abc := a + b + c
ab := abc[:len(abc)-len(c)]
...
}
import "testing"
var ma = make(map[[2]string]struct{})
var ms = make(map[string]struct{})
103
The benchmark result:
Benchmark_array_key-4 147.0 ns/op 0 B/op 0 allocs/op
Benchmark_string_key-4 507.9 ns/op 40 B/op 3 allocs/op
We could also use struct values as the map keys, which should be as performant as using array keys.
The third example, which shows the performance difference between two ways of string compar-
isons by ignoring cases.
package bytes
import "testing"
import "strings"
var ss = []string {
"AbcDefghijklmnOpQrStUvwxYz1234567890",
"abcDefghijklmnopQRSTuvwXYZ1234567890",
"aBcDefgHIjklMNOPQRSTuvwxyz1234567890",
}
104
import "testing"
import "io"
105
w.Write([]byte(s))
}
}
The benchmark results:
Benchmark_BytesWriter-4 21.03 ns/op 0 B/op 0 allocs/op
Benchmark_GeneralWriter-4 390.3 ns/op 512 B/op 1 allocs/op
From the benchmark results, we could find that the BytesWriter way is much more performant
than the general io.Writer way, because the former way doesn’t allocate (except the single buffer
allocation).
Please note, there is a type in the standard package, bufio.Writer, which acts like the
BytesWriter type. Generally, we should use that type instead.
106
Chapter 10
10.1 Example 1
A simple example:
// example1.go
package main
107
func f1c(a [5]int) {
_ = a[0]
_ = a[4]
}
func main() {}
Let’s run it with the -d=ssa/check_bce compiler option:
$ go run -gcflags="-d=ssa/check_bce" example1.go
./example1.go:5:7: Found IsInBounds
./example1.go:12:3: Found IsInBounds
The outputs show that only two code lines needs bound checks in the above example code.
Note that: * Go toolchains with version smaller than 1.19 fail to remove the unnecessary bound
check in the f1e function. * Go toolchains with version smaller than 1.21 fail to remove the unnec-
essary bound check in the f1g function.
And note that, up to now (Go toolchain v1.22.n), the official standard compiler doesn’t check BCE
for an operation in a generic function if the operation involves type parameters and the generic func-
tion is never instantiated. For example, the command go run -gcflags=-d=ssa/check_bce
bar.go will report nothing.
// bar.go
package bar
// var _ = foo[bool]
However, if the variable declaration line is enabled, then the compiler will report:
108
./bar.go:5:7: Found IsInBounds
./bar.go:6:7: Found IsInBounds
./bar.go:7:7: Found IsInBounds
./bar.go:4:6: Found IsInBounds
10.2 Example 2
All the bound checks in the slice element indexing and subslice operations shown in the following
example are eliminated.
// example2.go
package main
109
}
func main() {}
Run it, we will find that nothing is outputted. Yes, the official standard Go compiler is so clever
that it finds all bound checks may be removed in the above example code.
$ go run -gcflags="-d=ssa/check_bce" example2.go
There are still some small imperfections. If we modify the f2g and f2h functions as that shown
in the following code, then the compiler (v1.22.n) fails to remove the bound checks for the two
subslice operations.
// example2b.go
package main
func main() {}
Run it, we will get the following output:
$ go run -gcflags="-d=ssa/check_bce" example2b.go
./example2b.go:7:8: Found IsSliceInBounds
./example2b.go:14:8: Found IsSliceInBounds
We may give the compiler some hints by turning on the comment lines to remove these bound
checks.
10.3 Example 3
We should try to evaluate the element indexing or subslice operation with the largest index as earlier
as possible to reduce the number of bound checks.
In the following example, if the expression s[3] is evaluated without panicking, then the bound
checks for s[0], s[1] and s[2] could be eliminated.
// example3.go
package main
110
s[3] // Found IsInBounds
}
func main() {
}
Run it, we get:
./example3.go:5:10: Found IsInBounds
./example3.go:6:4: Found IsInBounds
./example3.go:7:4: Found IsInBounds
./example3.go:8:4: Found IsInBounds
./example3.go:12:10: Found IsInBounds
From the output, we could learn that there are 4 bound checks in the f3a function, but only one in
the f3b function.
10.4 Example 4
Since Go toolchain v1.19, the bould check in the f5a function is successfully removed,
func f5a(isa []int, isb []int) {
if len(isa) > 0xFFF {
for _, n := range isb {
_ = isa[n & 0xFFF]
}
}
}
However, before Go toolchain v1.19, the check is not removed. The compilers before version 1.19
need a hint to be removed, as shown in the f5b function:
func f5b(isa []int, isb []int) {
if len(isa) > 0xFFF {
// A successful hint (for v1.18- compilers)
isa = isa[:0xFFF+1]
for _, n := range isb {
_ = isa[n & 0xFFF] // BCEed!
}
}
}
111
_ = isa[n & 0xFFF] // Found IsInBounds
}
}
}
112
func f4d(is []int, bs []byte) {
if len(is) >= 256 {
_ = is[255] // a non-workable hint
for _, n := range bs {
_ = is[n] // Found IsInBounds
}
}
}
Please note that, as of Go toolchain v1.22.n, the two hints used in the f4c and f4d functions are
not workable (but they should).
In the following example, by adding a redundant if code block in the function NumSameBytes_2,
all bound checks in the loop are eliminated.
type T = string
// a successful hint
if len(x) > len(y) {
panic("unreachable")
}
113
x, y = y, x
}
114
_ = s[1]
_ = s[0]
}
However, please note that, there might be some other factors which will affect program perfor-
mances. On my machine (Intel i5-4210U CPU @ 1.70GHz, Linux/amd64), among the above 3
functions, the function f7b is actually the least performant one.
Benchmark_f7a-4 3861 ns/op
Benchmark_f7b-4 4223 ns/op
Benchmark_f7c-4 3477 ns/op
In practice, it is encouraged to use the three-index subslice form (f7c).
In the following example, benchmark results show that
• the f8z function is the most performant one (in line with expectation)
• but the f8y function is as performant as the f8x function (unexpected).
func f8x(s []byte) {
var n = len(s)
s = s[:n]
for i := 0; i <= n - 4; i += 4 {
_ = s[i+3] // Found IsInBounds
_ = s[i+2] // Found IsInBounds
_ = s[i+1] // Found IsInBounds
_ = s[i]
}
}
115
s2 := s[i:i+4] // Found IsInBounds
_ = s2[3]
_ = s2[2]
_ = s2[1]
_ = s2[0]
}
}
116
func f0a(x [16]byte) (r [4]byte){
for i := 0; i < 4; i++ {
r[i] =
x[i*4+3] ^ // Found IsInBounds
x[i*4+2] ^ // Found IsInBounds
x[i*4+1] ^ // Found IsInBounds
x[i*4] // Found IsInBounds
}
return
}
func fa0() {
for i := range s {
s[i] = i // Found IsInBounds
}
}
func fa1() {
s := s
for i := range s {
s[i] = i
}
}
117
func fb2() int {
return a[100]
}
118
Chapter 11
Maps
In Go, the capacity of a map is unlimited in theory, it is only limited by available memory. That is
why the built-in cap function doesn’t apply to maps.
In the official standard Go runtime implementation, maps are implemented as hashtables internally.
Each map/hashtable maintains a backing array to store map entries (key-value pairs). Along with
more and more entries are put into a map, the size of the backing array might be thought as too
small to store more entries, thus a new larger backing array will be allocated and the current entries
(in the old backing array) will be moved to it, then the old backing array will be discarded.
In the official standard Go runtime implementation, the backing array of a map will never shrink,
even if all entries are deleted from the map. This is a form of memory wasting. But in practice, this
is seldom a problem and and actually often good for program performances.
119
11.2 aMap[key]++ is more efficient than aMap[key] = aMap[key]
+ 1
In the statement aMap[key] = aMap[key] + 1, the key are hashed twice, but in the statement
aMap[key]++, it is only hashed once.
Similarly, aMap[key] += value is more efficient than aMap[key] = aMap[key] + value.
These could be proved by the following benchmark code:
package maps
import "testing"
var m = map[int]int{}
120
If we can make sure that the string values used in the entries of a map have a max length and the
max length is small, then we could use the array type [N]byte to replace the string types (where N
is the max string length). Doing this will save much garbage collection scanning time if the number
of the entries in the map is very large.
For example, in the following code, the entries of mapB contain no pointers, but the (string) keys of
mapA contain pointers. So garbage collector will skip mapB during the scan phase of a GC cycle.
var mapA = make(map[string]int, 1 << 16)
var mapB = make(map[[32]byte]int, 1 << 16)
And please note that, the official standard compiler makes special optimizations on hashing map
keys whose sizes are 4 or 8 bytes. So, from the point of view of saving CPU, it is better to
use map[[8]byte]V instead of map[[5]byte]V, and it is better to use map[int32]V instead of
map[int16]V.
import "testing"
121
}
}
}
122
11.6 Try to grow a map in one step
If we could predict the max number of entries will be put into a map at coding time, we should
create the map with the make function and pass the max number as the size argument of the make
call, to avoid growing the map in multiple steps later.
11.7 Use index tables instead of maps which key types have only
a small set of possible values
Some programmers like to use a map with bool key to reduce verbose if-else code block uses.
For example, the following code
// Within a function ...
var condition bool
condition = evaluateCondition()
...
if condition {
counter++
} else {
counter--
}
...
if condition {
f()
} else {
g()
}
...
could be replaced with
// Package-level maps.
var boolToInt = map[bool]int{true: 1, false: 0}
var boolToFunc = map[bool]func(){true: f, false: g}
import "testing"
123
//go:noiline
func f() {}
//go:noiline
func g() {}
124
func Benchmark_BoolMap(b *testing.B) {
for i := 0; i < b.N; i++ {
boolMap[b2i(true)]()
boolMap[b2i(false)]()
}
}
From the above code, we could find that the uses of the index table way are almost as clean as the
map-switch way, though an extra tiny b2i function is needed. And from the following benchmark
results, we know that the index table way is as performant as the if-else block way.
Benchmark_IfElse-4 4.155 ns/op
Benchmark_MapSwitch-4 47.46 ns/op
Benchmark_BoolMap-4 4.135 ns/op
125
Chapter 12
Channels
import (
"sync"
"sync/atomic"
"testing"
)
var g int32
var m sync.Mutex
func Benchmark_Mutex(b *testing.B) {
for i := 0; i < b.N; i++ {
m.Lock()
g++
m.Unlock()
126
}
}
import "testing"
127
<-ch2
}
}
}
The benchmark results:
Benchmark_Select_OneCase-4 58.90 ns/op
Benchmark_Select_TwoCases-4 115.3 ns/op
So we should try to limit the number of case branches within a select code block.
The official standard Go compiler treats a select code block with only one case branch (and
without default branch) as a simple general channel operation.
For some cases, we could merge multiple channels as one, to avoid the performance loss on execut-
ing multi-case select code blocks. We could use an interface type or a struct type as the channel
element type to achieve this goal. If the channel element type is interface, then we can use a type
switch to distinguish message kinds. If the channel element type is struct, then we can check which
field is set to distinguish message kinds. The following benchmark code shows the performance
differences between these ways.
package channels
import "testing"
var vx int
var vy string
128
}
}
}
import "testing"
129
for i := 0; i < b.N; i++ {
select {
case c <- struct{}{}:
default:
}
}
}
The benchmark results:
Benchmark_TryReceive-4 5.646 ns/op
Benchmark_TrySend-4 5.293 ns/op
From the above results and the results shown in the first section of the current chapter, we could
get that a try-send or try-receive code block is much less CPU consuming than a normal channel
receive or send channel operation.
130
Chapter 13
Functions
131
var c = a*a - b*b + 2 * (a - b)
var d = b*b - a*a + 2 * (b - a)
return c*c + d*d
}
After the flattening, some stack operations originally happening when calling the bar functions are
saved so that code execution performance gets improved.
Inlining will make generated Go binaries larger, so compilers only inline calls to small functions.
func main() {
println(sumSquares(5))
}
Besides the above rules, for various reasons, currently (v1.22.n), the official standard Go compiler
never inlines functions containing:
• built-in recover function calls
• defer calls
• go calls
For example, in the following code, the official standard Go compiler (v1.22.n) thinks all of the fN
functions are inline-able but none of the gN functions are.
func f1(s []int) int {
return cap(s) - len(s)
}
132
func g1(s []int) int {
recover()
return cap(s) - len(s)
}
133
}
}
func main() {
println(plusSquare(5))
}
134
13.1.3 The go:noinline comment directive
Sometimes, we might want calls to a function to never get inlined, for study and testing purposes,
or to make a caller function of the function inline-able (see below for an example), etc. Besides the
several ways introduced above, we could also use the go:noinline comment directive to achieve
this goal. For example, the compiler will not inline the call to the add function in the following
code, even if the add function is very simple.
package main
//go:noinline
func add(x, y int) int {
return x + y
}
func main() {
println(add(1, 2))
}
However, please note that this is not a formal way to avoid inlining. It is mainly intended to be used
in standard package and Go toolchain developments. But personally, I think this directive will be
supported in a long term.
13.1.4 Write code in the ways which are less inline costly
Generally, we should try to make more functions inline-able, to get better program execution per-
formances.
Besides the rules introduce above, we should know that different code implementation ways might
have different inline costs, even if the code differences are subtle. We could make use of this fact
to try different implementation ways to find out which way has the lowest inline cost.
Let’s use the first example shown above again.
// inline2.go
package inline
135
return c*c + d*d
}
Build it with double -m options:
$ go build -gcflags="-m -m" inline2.go
...
./inline2.go:8:6: cannot inline foo: function too complex: cost 96 exceeds budget 80
...
./inline2.go:16:6: can inline foo2 with cost 76 as: ...
From the outputs, we could learn that although the compiler thinks the foo function is not inline-
able, but it thinks its manual-flattened version (the foo2 function) is inline-able, because the inline
cost of the foo2 function calculated by the compiler is 76, which doesn’t exceed the inline threshold
(80). Yes, manual-inlining is often less costly than compiler auto-inlining. And, in practice, manual
inlined code is indeed often comparatively more performant (see below for an example).
Another example:
// sum.go
package inline
136
• local variable declarations contributes to inline costs.
• bare return statements are less inline costly than non-bare return statements.
(Please note that code inline costs don’t mean code execution costs. In fact, the official standard
Go compiler generate identical assembly instructions for the above sumN functions.)
Note, since v1.18, the official standard Go compiler thinks the inline cost of for-range loop is
smaller than a plain for loop. For example, the compiler thinks the inline cost of the following
sum4 function is 11, which is much smaller than the above plain for loops.
func sum4(s []int) (r int) {
for i := range s {
r += s[i]
}
return
}
The third example:
// branches.go
package inline
137
# command-line-arguments
./branches.go:4:6: can inline foo with cost 18 as: ...
./branches.go:12:6: can inline foo2 with cost 14 as: ...
./branches.go:20:6: can inline bar with cost 12 as: ...
./branches.go:27:6: can inline bar2 with cost 11 as: ...
The 4th example:
// funcvalue.go
package inline
138
For example, the concat function in the following code is not inline-able, for its inline cost is 85
(larger than the threshold 80).
func concat(bss ...[]byte) []byte {
n := len(bss)
if n == 0 {
return nil
} else if n == 1 {
return bss[0]
} else if n == 2 {
return append(bss[0], bss[1]...)
}
var m = 0
for i := 0; i < len(bss); i++ {
m += len(bss[i])
}
var r = make([]byte, 0, m)
for i := 0; i < len(bss); i++ {
r = append(r, bss[i]...)
}
return r
}
If, in practice, most cases are concatenating two byte slices, then we could rewrite the above code as
the following shown. Now the inline cost of the concat function becomes 74 so that it is inline-able
now. That means the hot path will be always inlined.
func concat(bss ...[]byte) []byte {
if len(bss) == 2 {
return append(bss[0], bss[1]...)
}
return concatSlow(bss...)
}
//go:noinline
func concatSlow(bss ...[]byte) []byte {
if len(bss) == 0 {
return nil
} else if len(bss) == 1 {
return bss[0]
}
var m = 0
for i := 0; i < len(bss); i++ {
m += len(bss[i])
}
var r = make([]byte, 0, m)
for i := 0; i < len(bss); i++ {
r = append(r, bss[i]...)
}
return r
139
}
If the inline cost of the function wrapping the code path part doesn’t exceed the inline threshold,
then we should use the above introduced avoid-being-inlined ways to prevent the function from
being inline-able. Otherwise, the rewritten concat function is still not inline-able, for the wrapped
part be automatically flattened into the rewritten function. That is why the go:noinline comment
directive is put before the rewritten concatSlow function.
Please note that, currently (Go toolchain v1.22.n), the inline cost of a non-inlined function call is
59. That means a function is not inline-able if it contains 2+ non-inlined calls.
And please note that, since Go toolchain v1.18, if we replace the two plain for loops within the
original concat function with two for-range loops, then the original function will become inline-
able already. Here, for demo purpose, we use two plain for loops.
import "testing"
const N = 100
//====================
140
The official standard Go compiler might be improved in future versions so that automatic inlining
will become more smart.
import "testing"
type T [1<<8]byte
var r, s T
//go:noinline
func not_inline_able(x1, y1 *T) {
x, y := x1[:], y1[:]
for k := 0; k < len(T{}); k++ {
x[k] = y[k]
}
}
141
Benchmark_not_inlined-4 127.9 ns/op
Benchmark_auto_inlined-4 196.4 ns/op
Benchmark_manual_inlined-4 196.4 ns/op
The implementation flaw (in the official standard Go compiler v1.22.n) presents when the manipu-
lated values are global (package-level) arrays.
Future official standard compiler versions might fix the flaw.
import "testing"
//go:noinline
func Add5_TT_T(x, y T5) (z T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
return
}
//go:noinline
func Add5_PPP(z, x, y *T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
}
142
t5 = z
}
}
import "testing"
//go:noinline
func Add4_TT_T(x, y T4) (z T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
return
}
//go:noinline
func Add4_PPP(z, x, y *T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
}
143
func Benchmark_Add4_PPP(b *testing.B) {
for i := 0; i < b.N; i++ {
var x, y, z T4
Add4_PPP(&z, &x, &y)
t4 = z
}
}
The new benchmark results:
Benchmark_Add4_TT_T-4 2.716 ns/op
Benchmark_Add4_PPP-4 9.006 ns/op
import "testing"
const N = 1<<12
var buf = make([]byte, N)
var r [128][N]byte
144
Benchmark_ConvertToArray_Unnamed-4 332.9 ns/op
From the results, we could find that the function with a named result performs slower. It looks this
is a problem related to code inlining. If the two if b == nil {...} lines are enabled (to prevent
the calls to the two functions from being inlined), then there is no performance difference between
the two functions. The future compiler versions might remove the performance difference when
the two functions are both inline-able.
The following two CopyToArray implementations shows the opposite result. the one with anony-
mous results is slower than the one with named results, whether or not the two functions are inlined.
package functions
import "testing"
const N = 1<<12
var buf = make([]byte, N)
var r [128][N]byte
145
13.4 Try to store intermediate calculation results in local vari-
ables with sizes not larger than a native word
Storing intermediate calculation results in local variables no larger than a native word can signifi-
cantly improve performance due to their higher chance of being allocated to registers.
An example:
package functions
import "testing"
146
h(s)
}
}
The benchmark results:
Benchmark_f-4 2802 ns/op
Benchmark_g-4 555.5 ns/op
Benchmark_h-4 2730 ns/op
import "testing"
var n int
func inc() {
n++
}
147
}
}
The benchmark results:
Benchmark_f-4 33232 ns/op 2 B/op 0 allocs/op
Benchmark_g-4 5237 ns/op 0 B/op 0 allocs/op
The reason why the function g is much more performant than the function f is that deferred calls
which are not directly in loops are specially optimized by the official standard Go compiler. The
function g wraps the code in the loop into an anonymous function call so that the deferred call is
not directly enclosed in the loop.
Please note that, the two functions are not equivalent to each other in logic. If this is a problem,
then the anonymous function call trick should not be used.
import (
"log"
"testing"
)
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
148
var h, w = "hello ", "world!"
var n = stat(func(){
debugPrint(h + w)
})
println(n) // 1
}
One way to avoid the unnecessary argument evaluation is to change the debugOn value to a constant.
But some programs might need to change the value on the fly, so this way is not always feasible.
We could let the debugPrint function return a bool result and call the function in a boolean-and
operation, like the following code shows:
package main
import (
"log"
"testing"
)
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
var n = stat(func(){
_ = debugOn && debugPrint(h + w)
})
println(n) // 0
}
In the above code, the string concatenation expression h + w is not evaluated, because the
debugPrint function call is not invoked at all.
13.8 Try to make less values escape to heap in the hot paths
Assume most calls to the function f shown in the following code return from the if code block
(most arguments are in the range [0, 9]), then the implementation of the function f is not very
efficient, because the argument x will escape to heap.
149
package main
import "strconv"
return g(&x)
}
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
func main() {
var a = f(100)
println(a)
}
By making use of the trick introduced in the stack and escape analysis article, we could rewrite the
function f as the following shows, to prevent the argument x from escaping to heap.
func f(x int) string {
if x >= 0 && x < 10 {
return "0123456789"[x:x+1]
}
x2 := x // x2 escapes to heap
return g(&x2)
}
150
Chapter 14
Interfaces
import "testing"
var r interface{}
151
}
var s = "Go"
func Benchmark_BoxString(b *testing.B) {
for i := 0; i < b.N; i++ { r = s }
}
var x = []int{1, 2, 3}
func Benchmark_BoxSlice(b *testing.B) {
for i := 0; i < b.N; i++ { r = x }
}
var a = [100]int{}
func Benchmark_BoxArray(b *testing.B) {
for i := 0; i < b.N; i++ { r = a }
}
The benchmark results:
Benchmark_BoxInt16-4 16.62 ns/op 2 B/op 1 allocs/op
Benchmark_BoxInt32-4 16.36 ns/op 4 B/op 1 allocs/op
Benchmark_BoxInt64-4 19.51 ns/op 8 B/op 1 allocs/op
Benchmark_BoxFloat64-4 20.05 ns/op 8 B/op 1 allocs/op
Benchmark_BoxString-4 56.57 ns/op 16 B/op 1 allocs/op
Benchmark_BoxSlice-4 48.89 ns/op 24 B/op 1 allocs/op
Benchmark_BoxArray-4 247.2 ns/op 896 B/op 1 allocs/op
From the above benchmark results, we could get that each value boxing operation generally needs
one allocation, and the size of the allocated memory block is the same as the size of the boxed value.
The official standard Go compiler makes some optimizations so that the general rule mentioned
above is not always obeyed. One optimization made by the official standard Go compiler is that no
allocations are made when boxing zero-size values, boolean values and 8-bit integer values.
package interfaces
import "testing"
var r interface{}
var v0 struct{}
func Benchmark_BoxZeroSize1(b *testing.B) {
for i := 0; i < b.N; i++ { r = v0 }
}
var a0 [0]int64
func Benchmark_BoxZeroSize2(b *testing.B) {
for i := 0; i < b.N; i++ { r = a0 }
}
var b bool
func Benchmark_BoxBool(b *testing.B) {
for i := 0; i < b.N; i++ { r = b }
}
152
var n int8 = -100
func Benchmark_BoxInt8(b *testing.B) {
for i := 0; i < b.N; i++ { r = n }
}
The benchmark results:
Benchmark_BoxZeroSize1-4 1.133 ns/op 0 B/op 0 allocs/op
Benchmark_BoxZeroSize2-4 1.180 ns/op 0 B/op 0 allocs/op
Benchmark_BoxBool-4 1.180 ns/op 0 B/op 0 allocs/op
Benchmark_BoxInt8-4 1.822 ns/op 0 B/op 0 allocs/op
From the results, we could get that boxing zero-size values, boolean values and 8-bit integer values
doesn’t make memory allocations, which is one reason why such boxing operations are much faster.
Another optimization made by the official standard Go compiler is that no allocations are made
when boxing pointer values into interfaces. Thus, boxing pointer values is often much faster than
boxing non-pointer values.
The official standard Go compiler represents (the direct parts of) maps, channels and functions as
pointers internally, so boxing such values is also as faster as boxing pointers.
This could be proved from the following code:
package interfaces
import "testing"
var r interface{}
var p = new([100]int)
func Benchmark_BoxPointer(b *testing.B) {
for i := 0; i < b.N; i++ { r = p }
}
153
Benchmark_BoxFunction-4 1.126 ns/op 0 B/op 0 allocs/op
From the above results, we could get that boxing pointer values is very fast and doesn’t make
memory allocations. This explains the reason why declaring a method for *T is often more efficient
than for T if we intend to let the method implement an interface method.
By making use of this optimization, for some use cases, we could use a loop-up table to convert
some non-pointer values in a small set into pointer values. For example, in the following code, we
use an array to convert uint16 values into pointers to get much lower value boxing costs.
package interfaces
import "testing"
func init() {
for i := range values {
values[i] = uint16(i)
}
}
var r interface{}
154
Benchmark_Box_Lookup-4 1.137 ns/op 0 B/op 0 allocs/op
Benchmark_Unbox_Normal-4 0.7535 ns/op 0 B/op 0 allocs/op
Benchmark_Unbox_Lookup-4 0.7510 ns/op 0 B/op 0 allocs/op
(Please note that, the results show the Box_Normal function makes zero allocations, which is not
true. The value is about 0.99, which is truncated to zero. The reason is boxing values within [0,
255] doesn’t allocate, which will be mentioned below.)
The same optimization is also applied for boxing constant values:
package interfaces
import "testing"
var r interface{}
const S = "Go"
func Benchmark_BoxConstString(b *testing.B) {
for i := 0; i < b.N; i++ { r = S }
}
The benchmark results:
Benchmark_BoxInt64-4 1.136 ns/op 0 B/op 0 allocs/op
Benchmark_BoxFloat64-4 1.196 ns/op 0 B/op 0 allocs/op
Benchmark_BoxConstString-4 1.211 ns/op 0 B/op 0 allocs/op
In fact, the official standard Go compiler also makes similar optimizations for boxing the following
values:
• non-constant small integer values (in range [0, 255]) of any integer types (expect for 8-bit
ones, which have been covered in the first optimization mentioned above).
• non-constant zero values of floating-point/string/slice types.
Box non-constant small integer values:
package interfaces
import "testing"
var r interface{}
155
func Benchmark_BoxSmallInt32(b *testing.B) {
for i := 0; i < b.N; i++ { r = int32(i&255) }
}
import "testing"
var r interface{}
156
import "testing"
var r interface{}
157
• Boxing a non-constant not-small integer value (out of range [0, 255]) or a non-zero
floating-point value is about (or more than) 20 times slower than boxing a pointer value.
• Boxing non-nil slices or non-blank non-constant string values is about (or more than) 50
times slower than boxing a pointer values.
• Boxing a struct (array) value with only one field (element) which is a small integer or a zero
bool/numeric/string/slice/point value is as faster as boxing that field (element).
So, if value boxing operations are made frequently on the hot paths of code execution, it is recom-
mended to box values with small boxing costs.
import "testing"
var v = 9999999
var x, y interface{}
158
tiple arguments. For example, in the following code, the second fmt.Fprint call in the following
code is more performant than the first one, because it saves two allocations.
package main
import (
"fmt"
"io"
"testing"
)
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(100, f)
return int(allocs)
}
var x = "aaa"
var n = stat(func(){
// 3 allocations
fmt.Fprint(io.Discard, x, x, x)
})
println(n) // 3
var m = stat(func(){
var i interface{} = x // 1 allocation
// No allocations
fmt.Fprint(io.Discard, i, i, i)
})
println(m) // 1
}
import "testing"
159
//go:noinline
func (a Add) Do_NotInlined(x, y float64) float64 {
return x+y
}
160
For example, in the standard image package, there are many At(x, y int) color.Color
and Set(x, y int, c color.Color) methods, which are declared to implement the
image/draw.Image interface. The type color.Color is an interface type:
type Color interface {
RGBA() (r, g, b, a uint32)
}
Calling these At and Set methods causes short-lived memory allocations (for boxing non-interface
values) and calling the Color.RGBA interface method consumes a bit extra CPU resources (for
looking up a virtual table and not inline-able). These methods are very likely called massively in
image processing applications, which leads to a very bad code execution efficiency. To alleviate
the problem, Go 1.17 introduced a new interface type, image/draw.RGBA64Image:
type RGBA64Image interface {
image.Image
RGBA64At(x, y int) color.RGBA64
SetRGBA64(x, y int, c color.RGBA64)
}
The new added method RGBA64At returns a non-interface type color.RGBA64, and the new added
method SetRGBA64 accepts a color.RGBA64 argument.
type RGBA64 struct {
R, G, B, A uint16
}
By using the RGBA64At and SetRGBA64 methods, many memory allocations and CPU resources
are saved, so that code execution efficiency is improved much.
161