It’s pretty common for Go developers to mix slice length and capacity or not understand them thoroughly. Assimilating these two concepts is essential for efficiently handling core operations such as slice initialization and adding elements with append, copying, or slicing. This misunderstanding can lead to using slices suboptimally or even to memory leaks.
+
In Go, a slice is backed by an array. That means the slice’s data is stored contiguously in an array data structure. A slice also handles the logic of adding an element if the backing array is full or shrinking the backing array if it’s almost empty.
+
Internally, a slice holds a pointer to the backing array plus a length and a capacity. The length is the number of elements the slice contains, whereas the capacity is the number of elements in the backing array, counting from the first element in the slice. Let’s go through a few examples to make things clearer. First, let’s initialize a slice with a given length and capacity:
The first argument, representing the length, is mandatory. However, the second argument representing the capacity is optional. Figure 1 shows the result of this code in memory.
+
+
In this case, make creates an array of six elements (the capacity). But because the length was set to 3, Go initializes only the first three elements. Also, because the slice is an []int type, the first three elements are initialized to the zeroed value of an int: 0. The grayed elements are allocated but not yet used.
+
If we print this slice, we get the elements within the range of the length, [0 0 0]. If we set s[1] to 1, the second element of the slice updates without impacting its length or capacity. Figure 2 illustrates this.
+
+
However, accessing an element outside the length range is forbidden, even though it’s already allocated in memory. For example, s[4] = 0 would lead to the following panic:
+
panic: runtime error: index out of range [4] with length 3
+
+
How can we use the remaining space of the slice? By using the append built-in function:
+
s=append(s,2)
+
+
This code appends to the existing s slice a new element. It uses the first grayed element (which was allocated but not yet used) to store element 2, as figure 3 shows.
+
+
The length of the slice is updated from 3 to 4 because the slice now contains four elements. Now, what happens if we add three more elements so that the backing array isn’t large enough?
+
s=append(s,3,4,5)
+fmt.Println(s)
+
+
If we run this code, we see that the slice was able to cope with our request:
+
[0 1 0 2 3 4 5]
+
+
Because an array is a fixed-size structure, it can store the new elements until element 4. When we want to insert element 5, the array is already full: Go internally creates another array by doubling the capacity, copying all the elements, and then inserting element 5. Figure 4 shows this process.
+
+
The slice now references the new backing array. What will happen to the previous backing array? If it’s no longer referenced, it’s eventually freed by the garbage collector (GC) if allocated on the heap. (We discuss heap memory in mistake #95, “Not understanding stack vs. heap,” and we look at how the GC works in mistake #99, “Not understanding how the GC works.”)
+
What happens with slicing? Slicing is an operation done on an array or a slice, providing a half-open range; the first index is included, whereas the second is excluded. The following example shows the impact, and figure 5 displays the result in memory:
+
s1:=make([]int,3,6)// Three-length, six-capacity slice
+s2:=s1[1:3]// Slicing from indices 1 to 3
+
+
+
First, s1 is created as a three-length, six-capacity slice. When s2 is created by slicing s1, both slices reference the same backing array. However, s2 starts from a different index, 1. Therefore, its length and capacity (a two-length, five-capacity slice) differ from s1. If we update s1[1] or s2[0], the change is made to the same array, hence, visible in both slices, as figure 6 shows.
+
+
Now, what happens if we append an element to s2? Does the following code change s1 as well?
+
s2=append(s2,2)
+
+
The shared backing array is modified, but only the length of s2 changes. Figure 7 shows the result of appending an element to s2.
+
+
s1 remains a three-length, six-capacity slice. Therefore, if we print s1 and s2, the added element is only visible for s2:
+
s1=[010],s2=[102]
+
+
It’s important to understand this behavior so that we don’t make wrong assumptions while using append.
+
+Note
+
In these examples, the backing array is internal and not available directly to the Go developer. The only exception is when a slice is created from slicing an existing array.
+
+
One last thing to note: what if we keep appending elements to s2 until the backing array is full? What will the state be, memory-wise? Let’s add three more elements so that the backing array will not have enough capacity:
+
s2=append(s2,3)
+s2=append(s2,4)// At this stage, the backing is already full
+s2=append(s2,5)
+
+
This code leads to creating another backing array. Figure 8 displays the results in memory.
+
+
s1 and s2 now reference two different arrays. As s1 is still a three-length, six-capacity slice, it still has some available buffer, so it keeps referencing the initial array. Also, the new backing array was made by copying the initial one from the first index of s2. That’s why the new array starts with element 1, not 0.
+
To summarize, the slice length is the number of available elements in the slice, whereas the slice capacity is the number of elements in the backing array. Adding an element to a full slice (length == capacity) leads to creating a new backing array with a new capacity, copying all the elements from the previous array, and updating the slice pointer to the new array.
When working with maps in Go, we need to understand some important characteristics of how a map grows and shrinks. Let’s delve into this to prevent issues that can cause memory leaks.
+
First, to view a concrete example of this problem, let’s design a scenario where we will work with the following map:
+
m:=make(map[int][128]byte)
+
+
Each value of m is an array of 128 bytes. We will do the following:
+
+
Allocate an empty map.
+
Add 1 million elements.
+
Remove all the elements, and run a Garbage Collection (GC).
+
+
After each step, we want to print the size of the heap (using a printAlloc utility function). This shows us how this example behaves memory-wise:
+
funcmain(){
+n:=1_000_000
+m:=make(map[int][128]byte)
+printAlloc()
+
+fori:=0;i<n;i++{// Adds 1 million elements
+m[i]=[128]byte{}
+}
+printAlloc()
+
+fori:=0;i<n;i++{// Deletes 1 million elements
+delete(m,i)
+}
+
+runtime.GC()// Triggers a manual GC
+printAlloc()
+runtime.KeepAlive(m)// Keeps a reference to m so that the map isn’t collected
+}
+
+funcprintAlloc(){
+varmruntime.MemStats
+runtime.ReadMemStats(&m)
+fmt.Printf("%d MB\n",m.Alloc/(1024*1024))
+}
+
+
We allocate an empty map, add 1 million elements, remove 1 million elements, and then run a GC. We also make sure to keep a reference to the map using runtime.KeepAlive so that the map isn’t collected as well. Let’s run this example:
+
0 MB <-- After m is allocated
+461 MB <-- After we add 1 million elements
+293 MB <-- After we remove 1 million elements
+
+
What can we observe? At first, the heap size is minimal. Then it grows significantly after having added 1 million elements to the map. But if we expected the heap size to decrease after removing all the elements, this isn’t how maps work in Go. In the end, even though the GC has collected all the elements, the heap size is still 293 MB. So the memory shrunk, but not as we might have expected. What’s the rationale? We need to delve into how a map works in Go.
+
A map provides an unordered collection of key-value pairs in which all the keys are distinct. In Go, a map is based on the hash table data structure: an array where each element is a pointer to a bucket of key-value pairs, as shown in figure 1.
+
+
Each bucket is a fixed-size array of eight elements. In the case of an insertion into a bucket that is already full (a bucket overflow), Go creates another bucket of eight elements and links the previous one to it. Figure 2 shows an example:
+
+
Under the hood, a Go map is a pointer to a runtime.hmap struct. This struct contains multiple fields, including a B field, giving the number of buckets in the map:
+
typehmapstruct{
+Buint8// log_2 of # of buckets
+// (can hold up to loadFactor * 2^B items)
+// ...
+}
+
+
After adding 1 million elements, the value of B equals 18, which means 2¹⁸ = 262,144 buckets. When we remove 1 million elements, what’s the value of B? Still 18. Hence, the map still contains the same number of buckets.
+
The reason is that the number of buckets in a map cannot shrink. Therefore, removing elements from a map doesn’t impact the number of existing buckets; it just zeroes the slots in the buckets. A map can only grow and have more buckets; it never shrinks.
+
In the previous example, we went from 461 MB to 293 MB because the elements were collected, but running the GC didn’t impact the map itself. Even the number of extra buckets (the buckets created because of overflows) remains the same.
+
Let’s take a step back and discuss when the fact that a map cannot shrink can be a problem. Imagine building a cache using a map[int][128]byte. This map holds per customer ID (the int), a sequence of 128 bytes. Now, suppose we want to save the last 1,000 customers. The map size will remain constant, so we shouldn’t worry about the fact that a map cannot shrink.
+
However, let’s say we want to store one hour of data. Meanwhile, our company has decided to have a big promotion for Black Friday: in one hour, we may have millions of customers connected to our system. But a few days after Black Friday, our map will contain the same number of buckets as during the peak time. This explains why we can experience high memory consumption that doesn’t significantly decrease in such a scenario.
+
What are the solutions if we don’t want to manually restart our service to clean the amount of memory consumed by the map? One solution could be to re-create a copy of the current map at a regular pace. For example, every hour, we can build a new map, copy all the elements, and release the previous one. The main drawback of this option is that following the copy and until the next garbage collection, we may consume twice the current memory for a short period.
+
Another solution would be to change the map type to store an array pointer: map[int]*[128]byte. It doesn’t solve the fact that we will have a significant number of buckets; however, each bucket entry will reserve the size of a pointer for the value instead of 128 bytes (8 bytes on 64-bit systems and 4 bytes on 32-bit systems).
+
Coming back to the original scenario, let’s compare the memory consumption for each map type following each step. The following table shows the comparison.
+
+
+
+
Step
+
map[int][128]byte
+
map[int]*[128]byte
+
+
+
+
+
Allocate an empty map
+
0 MB
+
0 MB
+
+
+
Add 1 million elements
+
461 MB
+
182 MB
+
+
+
Remove all the elements and run a GC
+
293 MB
+
38 MB
+
+
+
+
+Note
+
If a key or a value is over 128 bytes, Go won’t store it directly in the map bucket. Instead, Go stores a pointer to reference the key or the value.
+
+
As we have seen, adding n elements to a map and then deleting all the elements means keeping the same number of buckets in memory. So, we must remember that because a Go map can only grow in size, so does its memory consumption. There is no automated strategy to shrink it. If this leads to high memory consumption, we can try different options such as forcing Go to re-create the map or using pointers to check if it can be optimized.
Interfaces are one of the cornerstones of the Go language when designing and structuring our code. However, like many tools or concepts, abusing them is generally not a good idea. Interface pollution is about overwhelming our code with unnecessary abstractions, making it harder to understand. It’s a common mistake made by developers coming from another language with different habits. Before delving into the topic, let’s refresh our minds about Go’s interfaces. Then, we will see when it’s appropriate to use interfaces and when it may be considered pollution.
+
Concepts
+
An interface provides a way to specify the behavior of an object. We use interfaces to create common abstractions that multiple objects can implement. What makes Go interfaces so different is that they are satisfied implicitly. There is no explicit keyword like implements to mark that an object X implements interface Y.
+
To understand what makes interfaces so powerful, we will dig into two popular ones from the standard library: io.Reader and io.Writer. The io package provides abstractions for I/O primitives. Among these abstractions, io.Reader relates to reading data from a data source and io.Writer to writing data to a target, as represented in the next figure:
Custom implementations of the io.Reader interface should accept a slice of bytes, filling it with its data and returning either the number of bytes read or an error.
+
On the other hand, io.Writer defines a single method, Write:
Custom implementations of io.Writer should write the data coming from a slice to a target and return either the number of bytes written or an error. Therefore, both interfaces provide fundamental abstractions:
+
+
io.Reader reads data from a source
+
io.Writer writes data to a target
+
+
What is the rationale for having these two interfaces in the language? What is the point of creating these abstractions?
+
Let’s assume we need to implement a function that should copy the content of one file to another. We could create a specific function that would take as input two *os.File. Or, we can choose to create a more generic function using io.Reader and io.Writer abstractions:
This function would work with *os.File parameters (as *os.File implements both io.Reader and io.Writer) and any other type that would implement these interfaces. For example, we could create our own io.Writer that writes to a database, and the code would remain the same. It increases the genericity of the function; hence, its reusability.
+
Furthermore, writing a unit test for this function is easier because, instead of having to handle files, we can use the strings and bytes packages that provide helpful implementations:
+
funcTestCopySourceToDest(t*testing.T){
+constinput="foo"
+source:=strings.NewReader(input)// Creates an io.Reader
+dest:=bytes.NewBuffer(make([]byte,0))// Creates an io.Writer
+
+err:=copySourceToDest(source,dest)// Calls copySourceToDest from a *strings.Reader and a *bytes.Buffer
+iferr!=nil{
+t.FailNow()
+}
+
+got:=dest.String()
+ifgot!=input{
+t.Errorf("expected: %s, got: %s",input,got)
+}
+}
+
+
In the example, source is a *strings.Reader, whereas dest is a *bytes.Buffer. Here, we test the behavior of copySourceToDest without creating any files.
+
While designing interfaces, the granularity (how many methods the interface contains) is also something to keep in mind. A known proverb in Go relates to how big an interface should be:
+
+
Rob Pike
+
The bigger the interface, the weaker the abstraction.
+
+
Indeed, adding methods to an interface can decrease its level of reusability. io.Reader and io.Writer are powerful abstractions because they cannot get any simpler. Furthermore, we can also combine fine-grained interfaces to create higher-level abstractions. This is the case with io.ReadWriter, which combines the reader and writer behaviors:
+
typeReadWriterinterface{
+Reader
+Writer
+}
+
+
+Note
+
As Einstein said, “Everything should be made as simple as possible, but no simpler.” Applied to interfaces, this denotes that finding the perfect granularity for an interface isn’t necessarily a straightforward process.
+
+
Let’s now discuss common cases where interfaces are recommended.
+
When to use interfaces
+
When should we create interfaces in Go? Let’s look at three concrete use cases where interfaces are usually considered to bring value. Note that the goal isn’t to be exhaustive because the more cases we add, the more they would depend on the context. However, these three cases should give us a general idea:
+
+
Common behavior
+
Decoupling
+
Restricting behavior
+
+
Common behavior
+
The first option we will discuss is to use interfaces when multiple types implement a common behavior. In such a case, we can factor out the behavior inside an interface. If we look at the standard library, we can find many examples of such a use case. For example, sorting a collection can be factored out via three methods:
+
+
Retrieving the number of elements in the collection
+
Reporting whether one element must be sorted before another
+
Swapping two elements
+
+
Hence, the following interface was added to the sort package:
+
typeInterfaceinterface{
+Len()int// Number of elements
+Less(i,jint)bool// Checks two elements
+Swap(i,jint)// Swaps two elements
+}
+
+
This interface has a strong potential for reusability because it encompasses the common behavior to sort any collection that is index-based.
+
Throughout the sort package, we can find dozens of implementations. If at some point we compute a collection of integers, for example, and we want to sort it, are we necessarily interested in the implementation type? Is it important whether the sorting algorithm is a merge sort or a quicksort? In many cases, we don’t care. Hence, the sorting behavior can be abstracted, and we can depend on the sort.Interface.
+
Finding the right abstraction to factor out a behavior can also bring many benefits. For example, the sort package provides utility functions that also rely on sort.Interface, such as checking whether a collection is already sorted. For instance:
Because sort.Interface is the right level of abstraction, it makes it highly valuable.
+
Let’s now see another main use case when using interfaces.
+
Decoupling
+
Another important use case is about decoupling our code from an implementation. If we rely on an abstraction instead of a concrete implementation, the implementation itself can be replaced with another without even having to change our code. This is the Liskov Substitution Principle (the L in Robert C. Martin’s SOLID design principles).
+
One benefit of decoupling can be related to unit testing. Let’s assume we want to implement a CreateNewCustomer method that creates a new customer and stores it. We decide to rely on the concrete implementation directly (let’s say a mysql.Store struct):
+
typeCustomerServicestruct{
+storemysql.Store// Depends on the concrete implementation
+}
+
+func(csCustomerService)CreateNewCustomer(idstring)error{
+customer:=Customer{id:id}
+returncs.store.StoreCustomer(customer)
+}
+
+
Now, what if we want to test this method? Because customerService relies on the actual implementation to store a Customer, we are obliged to test it through integration tests, which requires spinning up a MySQL instance (unless we use an alternative technique such as go-sqlmock, but this isn’t the scope of this section). Although integration tests are helpful, that’s not always what we want to do. To give us more flexibility, we should decouple CustomerService from the actual implementation, which can be done via an interface like so:
+
typecustomerStorerinterface{// Creates a storage abstraction
+StoreCustomer(Customer)error
+}
+
+typeCustomerServicestruct{
+storercustomerStorer// Decouples CustomerService from the actual implementation
+}
+
+func(csCustomerService)CreateNewCustomer(idstring)error{
+customer:=Customer{id:id}
+returncs.storer.StoreCustomer(customer)
+}
+
+
Because storing a customer is now done via an interface, this gives us more flexibility in how we want to test the method. For instance, we can:
+
+
Use the concrete implementation via integration tests
+
Use a mock (or any kind of test double) via unit tests
+
Or both
+
+
Let’s now discuss another use case: to restrict a behavior.
+
Restricting behavior
+
The last use case we will discuss can be pretty counterintuitive at first sight. It’s about restricting a type to a specific behavior. Let’s imagine we implement a custom configuration package to deal with dynamic configuration. We create a specific container for int configurations via an IntConfig struct that also exposes two methods: Get and Set. Here’s how that code would look:
Now, suppose we receive an IntConfig that holds some specific configuration, such as a threshold. Yet, in our code, we are only interested in retrieving the configuration value, and we want to prevent updating it. How can we enforce that, semantically, this configuration is read-only, if we don’t want to change our configuration package? By creating an abstraction that restricts the behavior to retrieving only a config value:
+
typeintConfigGetterinterface{
+Get()int
+}
+
+
Then, in our code, we can rely on intConfigGetter instead of the concrete implementation:
In this example, the configuration getter is injected into the NewFoo factory method. It doesn’t impact a client of this function because it can still pass an IntConfig struct as it implements intConfigGetter. Then, we can only read the configuration in the Bar method, not modify it. Therefore, we can also use interfaces to restrict a type to a specific behavior for various reasons, such as semantics enforcement.
+
In this section, we saw three potential use cases where interfaces are generally considered as bringing value: factoring out a common behavior, creating some decoupling, and restricting a type to a certain behavior. Again, this list isn’t exhaustive, but it should give us a general understanding of when interfaces are helpful in Go.
+
Now, let’s finish this section and discuss the problems with interface pollution.
+
Interface pollution
+
It’s fairly common to see interfaces being overused in Go projects. Perhaps the developer’s background was C# or Java, and they found it natural to create interfaces before concrete types. However, this isn’t how things should work in Go.
+
As we discussed, interfaces are made to create abstractions. And the main caveat when programming meets abstractions is remembering that abstractions should be discovered, not created. What does this mean? It means we shouldn’t start creating abstractions in our code if there is no immediate reason to do so. We shouldn’t design with interfaces but wait for a concrete need. Said differently, we should create an interface when we need it, not when we foresee that we could need it.
+
What’s the main problem if we overuse interfaces? The answer is that they make the code flow more complex. Adding a useless level of indirection doesn’t bring any value; it creates a worthless abstraction making the code more difficult to read, understand, and reason about. If we don’t have a strong reason for adding an interface and it’s unclear how an interface makes a code better, we should challenge this interface’s purpose. Why not call the implementation directly?
+
+Note
+
We may also experience performance overhead when calling a method through an interface. It requires a lookup in a hash table’s data structure to find the concrete type an interface points to. But this isn’t an issue in many contexts as the overhead is minimal.
+
+
In summary, we should be cautious when creating abstractions in our code—abstractions should be discovered, not created. It’s common for us, software developers, to overengineer our code by trying to guess what the perfect level of abstraction is, based on what we think we might need later. This process should be avoided because, in most cases, it pollutes our code with unnecessary abstractions, making it more complex to read.
+
+
Rob Pike
+
Don’t design with interfaces, discover them.
+
+
Let’s not try to solve a problem abstractly but solve what has to be solved now. Last, but not least, if it’s unclear how an interface makes the code better, we should probably consider removing it to make our code simpler.
A misconception among many developers is believing that a concurrent solution is always faster than a sequential one. This couldn’t be more wrong. The overall performance of a solution depends on many factors, such as the efficiency of our code structure (concurrency), which parts can be tackled in parallel, and the level of contention among the computation units. This post reminds us about some fundamental knowledge of concurrency in Go; then we will see a concrete example where a concurrent solution isn’t necessarily faster.
+
Go Scheduling
+
A thread is the smallest unit of processing that an OS can perform. If a process wants to execute multiple actions simultaneously, it spins up multiple threads. These threads can be:
+
+
Concurrent — Two or more threads can start, run, and complete in overlapping time periods.
+
Parallel — The same task can be executed multiple times at once.
+
+
The OS is responsible for scheduling the thread’s processes optimally so that:
+
+
All the threads can consume CPU cycles without being starved for too much time.
+
The workload is distributed as evenly as possible among the different CPU cores.
+
+
+Note
+
The word thread can also have a different meaning at a CPU level. Each physical core can be composed of multiple logical cores (the concept of hyper-threading), and a logical core is also called a thread. In this post, when we use the word thread, we mean the unit of processing, not a logical core.
+
+
A CPU core executes different threads. When it switches from one thread to another, it executes an operation called context switching. The active thread consuming CPU cycles was in an executing state and moves to a runnable state, meaning it’s ready to be executed pending an available core. Context switching is considered an expensive operation because the OS needs to save the current execution state of a thread before the switch (such as the current register values).
+
As Go developers, we can’t create threads directly, but we can create goroutines, which can be thought of as application-level threads. However, whereas an OS thread is context-switched on and off a CPU core by the OS, a goroutine is context-switched on and off an OS thread by the Go runtime. Also, compared to an OS thread, a goroutine has a smaller memory footprint: 2 KB for goroutines from Go 1.4. An OS thread depends on the OS, but, for example, on Linux/x86–32, the default size is 2 MB (see https://man7.org/linux/man-pages/man3/pthread_create.3.html). Having a smaller size makes context switching faster.
+
+Note
+
Context switching a goroutine versus a thread is about 80% to 90% faster, depending on the architecture.
+
+
Let’s now discuss how the Go scheduler works to overview how goroutines are handled. Internally, the Go scheduler uses the following terminology (see proc.go):
+
+
G — Goroutine
+
M — OS thread (stands for machine)
+
P — CPU core (stands for processor)
+
+
Each OS thread (M) is assigned to a CPU core (P) by the OS scheduler. Then, each goroutine (G) runs on an M. The GOMAXPROCS variable defines the limit of Ms in charge of executing user-level code simultaneously. But if a thread is blocked in a system call (for example, I/O), the scheduler can spin up more Ms. As of Go 1.5, GOMAXPROCS is by default equal to the number of available CPU cores.
+
A goroutine has a simpler lifecycle than an OS thread. It can be doing one of the following:
+
+
Executing — The goroutine is scheduled on an M and executing its instructions.
+
Runnable — The goroutine is waiting to be in an executing state.
+
Waiting — The goroutine is stopped and pending something completing, such as a system call or a synchronization operation (such as acquiring a mutex).
+
+
There’s one last stage to understand about the implementation of Go scheduling: when a goroutine is created but cannot be executed yet; for example, all the other Ms are already executing a G. In this scenario, what will the Go runtime do about it? The answer is queuing. The Go runtime handles two kinds of queues: one local queue per P and a global queue shared among all the Ps.
+
Figure 1 shows a given scheduling situation on a four-core machine with GOMAXPROCS equal to 4. The parts are the logical cores (Ps), goroutines (Gs), OS threads (Ms), local queues, and global queue:
+
+
First, we can see five Ms, whereas GOMAXPROCS is set to 4. But as we mentioned, if needed, the Go runtime can create more OS threads than the GOMAXPROCS value.
+
P0, P1, and P3 are currently busy executing Go runtime threads. But P2 is presently idle as M3 is switched off P2, and there’s no goroutine to be executed. This isn’t a good situation because six runnable goroutines are pending being executed, some in the global queue and some in other local queues. How will the Go runtime handle this situation? Here’s the scheduling implementation in pseudocode (see proc.go):
+
runtime.schedule() {
+ // Only 1/61 of the time, check the global runnable queue for a G.
+ // If not found, check the local queue.
+ // If not found,
+ // Try to steal from other Ps.
+ // If not, check the global runnable queue.
+ // If not found, poll network.
+}
+
+
Every sixty-first execution, the Go scheduler will check whether goroutines from the global queue are available. If not, it will check its local queue. Meanwhile, if both the global and local queues are empty, the Go scheduler can pick up goroutines from other local queues. This principle in scheduling is called work stealing, and it allows an underutilized processor to actively look for another processor’s goroutines and steal some.
+
One last important thing to mention: prior to Go 1.14, the scheduler was cooperative, which meant a goroutine could be context-switched off a thread only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex). Since Go 1.14, the Go scheduler is now preemptive: when a goroutine is running for a specific amount of time (10 ms), it will be marked preemptible and can be context-switched off to be replaced by another goroutine. This allows a long-running job to be forced to share CPU time.
+
Now that we understand the fundamentals of scheduling in Go, let’s look at a concrete example: implementing a merge sort in a parallel manner.
+
Parallel Merge Sort
+
First, let’s briefly review how the merge sort algorithm works. Then we will implement a parallel version. Note that the objective isn’t to implement the most efficient version but to support a concrete example showing why concurrency isn’t always faster.
+
The merge sort algorithm works by breaking a list repeatedly into two sublists until each sublist consists of a single element and then merging these sublists so that the result is a sorted list (see figure 2). Each split operation splits the list into two sublists, whereas the merge operation merges two sublists into a sorted list.
+
+
Here is the sequential implementation of this algorithm. We don’t include all of the code as it’s not the main point of this section:
+
funcsequentialMergesort(s[]int){
+iflen(s)<=1{
+return
+}
+
+middle:=len(s)/2
+sequentialMergesort(s[:middle])// First half
+sequentialMergesort(s[middle:])// Second half
+merge(s,middle)// Merges the two halves
+}
+
+funcmerge(s[]int,middleint){
+// ...
+}
+
+
This algorithm has a structure that makes it open to concurrency. Indeed, as each sequentialMergesort operation works on an independent set of data that doesn’t need to be fully copied (here, an independent view of the underlying array using slicing), we could distribute this workload among the CPU cores by spinning up each sequentialMergesort operation in a different goroutine. Let’s write a first parallel implementation:
+
funcparallelMergesortV1(s[]int){
+iflen(s)<=1{
+return
+}
+
+middle:=len(s)/2
+
+varwgsync.WaitGroup
+wg.Add(2)
+
+gofunc(){// Spins up the first half of the work in a goroutine
+deferwg.Done()
+parallelMergesortV1(s[:middle])
+}()
+
+gofunc(){// Spins up the second half of the work in a goroutine
+deferwg.Done()
+parallelMergesortV1(s[middle:])
+}()
+
+wg.Wait()
+merge(s,middle)// Merges the halves
+}
+
+
In this version, each half of the workload is handled in a separate goroutine. The parent goroutine waits for both parts by using sync.WaitGroup. Hence, we call the Wait method before the merge operation.
+
We now have a parallel version of the merge sort algorithm. Therefore, if we run a benchmark to compare this version against the sequential one, the parallel version should be faster, correct? Let’s run it on a four-core machine with 10,000 elements:
Surprisingly, the parallel version is almost an order of magnitude slower. How can we explain this result? How is it possible that a parallel version that distributes a workload across four cores is slower than a sequential version running on a single machine? Let’s analyze the problem.
+
If we have a slice of, say, 1,024 elements, the parent goroutine will spin up two goroutines, each in charge of handling a half consisting of 512 elements. Each of these goroutines will spin up two new goroutines in charge of handling 256 elements, then 128, and so on, until we spin up a goroutine to compute a single element.
+
If the workload that we want to parallelize is too small, meaning we’re going to compute it too fast, the benefit of distributing a job across cores is destroyed: the time it takes to create a goroutine and have the scheduler execute it is much too high compared to directly merging a tiny number of items in the current goroutine. Although goroutines are lightweight and faster to start than threads, we can still face cases where a workload is too small.
+
So what can we conclude from this result? Does it mean the merge sort algorithm cannot be parallelized? Wait, not so fast.
+
Let’s try another approach. Because merging a tiny number of elements within a new goroutine isn’t efficient, let’s define a threshold. This threshold will represent how many elements a half should contain in order to be handled in a parallel manner. If the number of elements in the half is fewer than this value, we will handle it sequentially. Here’s a new version:
+
constmax=2048// Defines the threshold
+
+funcparallelMergesortV2(s[]int){
+iflen(s)<=1{
+return
+}
+
+iflen(s)<=max{
+sequentialMergesort(s)// Calls our initial sequential version
+}else{// If bigger than the threshold, keeps the parallel version
+middle:=len(s)/2
+
+varwgsync.WaitGroup
+wg.Add(2)
+
+gofunc(){
+deferwg.Done()
+parallelMergesortV2(s[:middle])
+}()
+
+gofunc(){
+deferwg.Done()
+parallelMergesortV2(s[middle:])
+}()
+
+wg.Wait()
+merge(s,middle)
+}
+}
+
+
If the number of elements in the s slice is smaller than max, we call the sequential version. Otherwise, we keep calling our parallel implementation. Does this approach impact the result? Yes, it does:
Our v2 parallel implementation is more than 40% faster than the sequential one, thanks to this idea of defining a threshold to indicate when parallel should be more efficient than sequential.
+
+Note
+
Why did I set the threshold to 2,048? Because it was the optimal value for this specific workload on my machine. In general, such magic values should be defined carefully with benchmarks (running on an execution environment similar to production). It’s also pretty interesting to note that running the same algorithm in a programming language that doesn’t implement the concept of goroutines has an impact on the value. For example, running the same example in Java using threads means an optimal value closer to 8,192. This tends to illustrate how goroutines are more efficient than threads.
+
+
Conclusion
+
We have seen throughout this post the fundamental concepts of scheduling in Go: the differences between a thread and a goroutine and how the Go runtime schedules goroutines. Meanwhile, using the parallel merge sort example, we illustrated that concurrency isn’t always necessarily faster. As we have seen, spinning up goroutines to handle minimal workloads (merging only a small set of elements) demolishes the benefit we could get from parallelism.
+
So, where should we go from here? We must keep in mind that concurrency isn’t always faster and shouldn’t be considered the default way to go for all problems. First, it makes things more complex. Also, modern CPUs have become incredibly efficient at executing sequential code and predictable code. For example, a superscalar processor can parallelize instruction execution over a single core with high efficiency.
+
Does this mean we shouldn’t use concurrency? Of course not. However, it’s essential to keep these conclusions in mind. If we’re not sure that a parallel version will be faster, the right approach may be to start with a simple sequential version and build from there using profiling (mistake #98, “Not using Go diagnostics tooling”) and benchmarks (mistake #89, “Writing inaccurate benchmarks”), for example. It can be the only way to ensure that a concurrent implementation is worth it.
In general, we should never guess about performance. When writing optimizations, so many factors may come into play that even if we have a strong opinion about the results, it’s rarely a bad idea to test them. However, writing benchmarks isn’t straightforward. It can be pretty simple to write inaccurate benchmarks and make wrong assumptions based on them. The goal of this post is to examine four common and concrete traps leading to inaccuracy:
+
+
Not resetting or pausing the timer
+
Making wrong assumptions about micro-benchmarks
+
Not being careful about compiler optimizations
+
Being fooled by the observer effect
+
+
General concepts
+
Before discussing these traps, let’s briefly review how benchmarks work in Go. The skeleton of a benchmark is as follows:
The function name starts with the Benchmark prefix. The function under test (foo) is called within the for loop. b.N represents a variable number of iterations. When running a benchmark, Go tries to make it match the requested benchmark time. The benchmark time is set by default to 1 second and can be changed with the -benchtime flag. b.N starts at 1; if the benchmark completes in under 1 second, b.N is increased, and the benchmark runs again until b.N roughly matches benchtime:
+
$ go test -bench=.
+cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
+BenchmarkFoo-4 73 16511228 ns/op
+
+
Here, the benchmark took about 1 second, and foo was executed 73 times, for an average execution time of 16,511,228 nanoseconds. We can change the benchmark time using -benchtime:
+
$ go test -bench=. -benchtime=2s
+BenchmarkFoo-4 150 15832169 ns/op
+
+
foo was executed roughly twice more than during the previous benchmark.
+
Next, let’s look at some common traps.
+
Not resetting or pausing the timer
+
In some cases, we need to perform operations before the benchmark loop. These operations may take quite a while (for example, generating a large slice of data) and may significantly impact the benchmark results:
Calling ResetTimer zeroes the elapsed benchmark time and memory allocation counters since the beginning of the test. This way, an expensive setup can be discarded from the test results.
+
What if we have to perform an expensive setup not just once but within each loop iteration?
We can’t reset the timer, because that would be executed during each loop iteration. But we can stop and resume the benchmark timer, surrounding the call to expensiveSetup:
+
funcBenchmarkFoo(b*testing.B){
+fori:=0;i<b.N;i++{
+b.StopTimer()// Pause the benchmark timer
+expensiveSetup()
+b.StartTimer()// Resume the benchmark timer
+functionUnderTest()
+}
+}
+
+
Here, we pause the benchmark timer to perform the expensive setup and then resume the timer.
+
+Note
+
There’s one catch to remember about this approach: if the function under test is too fast to execute compared to the setup function, the benchmark may take too long to complete. The reason is that it would take much longer than 1 second to reach benchtime. Calculating the benchmark time is based solely on the execution time of functionUnderTest. So, if we wait a significant time in each loop iteration, the benchmark will be much slower than 1 second. If we want to keep the benchmark, one possible mitigation is to decrease benchtime.
+
+
We must be sure to use the timer methods to preserve the accuracy of a benchmark.
+
Making wrong assumptions about micro-benchmarks
+
A micro-benchmark measures a tiny computation unit, and it can be extremely easy to make wrong assumptions about it. Let’s say, for example, that we aren’t sure whether to use atomic.StoreInt32 or atomic.StoreInt64 (assuming that the values we handle will always fit in 32 bits). We want to write a benchmark to compare both functions:
We could easily take this benchmark for granted and decide to use atomic.StoreInt64 because it appears to be faster. Now, for the sake of doing a fair benchmark, we reverse the order and test atomic.StoreInt64 first, followed by atomic.StoreInt32. Here is some example output:
This time, atomic.StoreInt32 has better results. What happened?
+
In the case of micro-benchmarks, many factors can impact the results, such as machine activity while running the benchmarks, power management, thermal scaling, and better cache alignment of a sequence of instructions. We must remember that many factors, even outside the scope of our Go project, can impact the results.
+
+Note
+
We should make sure the machine executing the benchmark is idle. However, external processes may run in the background, which may affect benchmark results. For that reason, tools such as perflock can limit how much CPU a benchmark can consume. For example, we can run a benchmark with 70% of the total available CPU, giving 30% to the OS and other processes and reducing the impact of the machine activity factor on the results.
+
+
One option is to increase the benchmark time using the -benchtime option. Similar to the law of large numbers in probability theory, if we run a benchmark a large number of times, it should tend to approach its expected value (assuming we omit the benefits of instructions caching and similar mechanics).
+
Another option is to use external tools on top of the classic benchmark tooling. For instance, the benchstat tool, which is part of the golang.org/x repository, allows us to compute and compare statistics about benchmark executions.
+
Let’s run the benchmark 10 times using the -count option and pipe the output to a specific file:
+
$ go test -bench=. -count=10 | tee stats.txt
+cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
+BenchmarkAtomicStoreInt32-4 234935682 5.124 ns/op
+BenchmarkAtomicStoreInt32-4 235307204 5.112 ns/op
+// ...
+BenchmarkAtomicStoreInt64-4 235548591 5.107 ns/op
+BenchmarkAtomicStoreInt64-4 235210292 5.090 ns/op
+// ...
+
The results are the same: both functions take on average 5.10 nanoseconds to complete. We also see the percent variation between the executions of a given benchmark: ± 1%. This metric tells us that both benchmarks are stable, giving us more confidence in the computed average results. Therefore, instead of concluding that atomic.StoreInt32 is faster or slower, we can conclude that its execution time is similar to that of atomic.StoreInt64 for the usage we tested (in a specific Go version on a particular machine).
+
In general, we should be cautious about micro-benchmarks. Many factors can significantly impact the results and potentially lead to wrong assumptions. Increasing the benchmark time or repeating the benchmark executions and computing stats with tools such as benchstat can be an efficient way to limit external factors and get more accurate results, leading to better conclusions.
+
Let’s also highlight that we should be careful about using the results of a micro-benchmark executed on a given machine if another system ends up running the application. The production system may act quite differently from the one on which we ran the micro-benchmark.
+
Not being careful about compiler optimizations
+
Another common mistake related to writing benchmarks is being fooled by compiler optimizations, which can also lead to wrong benchmark assumptions. In this section, we look at Go issue 14813 (https://github.com/golang/go/issues/14813, also discussed by Go project member Dave Cheney) with a population count function (a function that counts the number of bits set to 1):
A duration of 0.28 nanoseconds is roughly one clock cycle, so this number is unreasonably low. The problem is that the developer wasn’t careful enough about compiler optimizations. In this case, the function under test is simple enough to be a candidate for inlining: an optimization that replaces a function call with the body of the called function and lets us prevent a function call, which has a small footprint. Once the function is inlined, the compiler notices that the call has no side effects and replaces it with the following benchmark:
The benchmark is now empty — which is why we got a result close to one clock cycle. To prevent this from happening, a best practice is to follow this pattern:
+
+
During each loop iteration, assign the result to a local variable (local in the context of the benchmark function).
+
Assign the latest result to a global variable.
+
+
In our case, we write the following benchmark:
+
varglobaluint64// Define a global variable
+
+funcBenchmarkPopcnt2(b*testing.B){
+varvuint64// Define a local variable
+fori:=0;i<b.N;i++{
+v=popcnt(uint64(i))// Assign the result to the local variable
+}
+global=v// Assign the result to the global variable
+}
+
+
global is a global variable, whereas v is a local variable whose scope is the benchmark function. During each loop iteration, we assign the result of popcnt to the local variable. Then we assign the latest result to the global variable.
+
+Note
+
Why not assign the result of the popcnt call directly to global to simplify the test? Writing to a global variable is slower than writing to a local variable (these concepts are discussed in 100 Go Mistakes, mistake #95: “Not understanding stack vs. heap”). Therefore, we should write each result to a local variable to limit the footprint during each loop iteration.
+
+
If we run these two benchmarks, we now get a significant difference in the results:
BenchmarkPopcnt2 is the accurate version of the benchmark. It guarantees that we avoid the inlining optimizations, which can artificially lower the execution time or even remove the call to the function under test. Relying on the results of BenchmarkPopcnt1 could have led to wrong assumptions.
+
Let’s remember the pattern to avoid compiler optimizations fooling benchmark results: assign the result of the function under test to a local variable, and then assign the latest result to a global variable. This best practice also prevents us from making incorrect assumptions.
+
Being fooled by the observer effect
+
In physics, the observer effect is the disturbance of an observed system by the act of observation. This effect can also be seen in benchmarks and can lead to wrong assumptions about results. Let’s look at a concrete example and then try to mitigate it.
+
We want to implement a function receiving a matrix of int64 elements. This matrix has a fixed number of 512 columns, and we want to compute the total sum of the first eight columns, as shown in figure 1.
+
+
For the sake of optimizations, we also want to determine whether varying the number of columns has an impact, so we also implement a second function with 513 columns. The implementation is the following:
+
funccalculateSum512(s[][512]int64)int64{
+varsumint64
+fori:=0;i<len(s);i++{// Iterate over each row
+forj:=0;j<8;j++{// Iterate over the first eight columns
+sum+=s[i][j]// Increment sum
+}
+}
+returnsum
+}
+
+funccalculateSum513(s[][513]int64)int64{
+// Same implementation as calculateSum512
+}
+
+
We iterate over each row and then over the first eight columns, and we increment a sum variable that we return. The implementation in calculateSum513 remains the same.
+
We want to benchmark these functions to decide which one is the most performant given a fixed number of rows:
+
constrows=1000
+
+varresint64
+
+funcBenchmarkCalculateSum512(b*testing.B){
+varsumint64
+s:=createMatrix512(rows)// Create a matrix of 512 columns
+b.ResetTimer()
+fori:=0;i<b.N;i++{
+sum=calculateSum512(s)// Create a matrix of 512 columns
+}
+res=sum
+}
+
+funcBenchmarkCalculateSum513(b*testing.B){
+varsumint64
+s:=createMatrix513(rows)// Create a matrix of 513 columns
+b.ResetTimer()
+fori:=0;i<b.N;i++{
+sum=calculateSum513(s)// Calculate the sum
+}
+res=sum
+}
+
+
We want to create the matrix only once, to limit the footprint on the results. Therefore, we call createMatrix512 and createMatrix513 outside of the loop. We may expect the results to be similar as again we only want to iterate on the first eight columns, but this isn’t the case (on my machine):
The second benchmark with 513 columns is about 50% faster. Again, because we iterate only over the first eight columns, this result is quite surprising.
+
To understand this difference, we need to understand the basics of CPU caches. In a nutshell, a CPU is composed of different caches (usually L1, L2, and L3). These caches reduce the average cost of accessing data from the main memory. In some conditions, the CPU can fetch data from the main memory and copy it to L1. In this case, the CPU tries to fetch into L1 the matrix’s subset that calculateSum is interested in (the first eight columns of each row). However, the matrix fits in memory in one case (513 columns) but not in the other case (512 columns).
+
+Note
+
This isn’t in the scope of this post to explain why, but we look at this problem in 100 Go Mistakes, mistake #91: “Not understanding CPU caches.”
+
+
Coming back to the benchmark, the main issue is that we keep reusing the same matrix in both cases. Because the function is repeated thousands of times, we don’t measure the function’s execution when it receives a plain new matrix. Instead, we measure a function that gets a matrix that already has a subset of the cells present in the cache. Therefore, because calculateSum513 leads to fewer cache misses, it has a better execution time.
+
This is an example of the observer effect. Because we keep observing a repeatedly called CPU-bound function, CPU caching may come into play and significantly affect the results. In this example, to prevent this effect, we should create a matrix during each test instead of reusing one:
+
funcBenchmarkCalculateSum512(b*testing.B){
+varsumint64
+fori:=0;i<b.N;i++{
+b.StopTimer()
+s:=createMatrix512(rows)// Create a new matrix during each loop iteration
+b.StartTimer()
+sum=calculateSum512(s)
+}
+res=sum
+}
+
+
A new matrix is now created during each loop iteration. If we run the benchmark again (and adjust benchtime — otherwise, it takes too long to execute), the results are closer to each other:
Instead of making the incorrect assumption that calculateSum513 is faster, we see that both benchmarks lead to similar results when receiving a new matrix.
+
As we have seen in this post, because we were reusing the same matrix, CPU caches significantly impacted the results. To prevent this, we had to create a new matrix during each loop iteration. In general, we should remember that observing a function under test may lead to significant differences in results, especially in the context of micro-benchmarks of CPU-bound functions where low-level optimizations matter. Forcing a benchmark to re-create data during each iteration can be a good way to prevent this effect.
Generics is a fresh addition to the language. In a nutshell, it allows writing code with types that can be specified later and instantiated when needed. However, it can be pretty easy to be confused about when to use generics and when not to. Throughout this post, we will describe the concept of generics in Go and then delve into common use and misuses.
+
Concepts
+
Consider the following function that extracts all the keys from a map[string]int type:
What if we would like to use a similar feature for another map type such as a map[int]string? Before generics, Go developers had a couple of options: using code generation, reflection, or duplicating code.
+
For example, we could write two functions, one for each map type, or even try to extend getKeys to accept different map types:
First, it increases boilerplate code. Indeed, whenever we want to add a case, it will require duplicating the range loop.
+
Meanwhile, the function now accepts an empty interface, which means we are losing some of the benefits of Go being a typed language. Indeed, checking whether a type is supported is done at runtime instead of compile-time. Hence, we also need to return an error if the provided type is unknown.
+
Last but not least, as the key type can be either int or string, we are obliged to return a slice of empty interfaces to factor out key types. This approach increases the effort on the caller-side as the client may also have to perform a type check of the keys or extra conversion.
+
+
Thanks to generics, we can now refactor this code using type parameters.
+
Type parameters are generic types we can use with functions and types. For example, the following function accepts a type parameter:
+
funcfoo[Tany](tT){
+// ...
+}
+
+
When calling foo, we will pass a type argument of any type. Passing a type argument is called instantiation because the work is done at compile time which keeps type safety as part of the core language features and avoids runtime overheads.
+
Let’s get back to the getKeys function and use type parameters to write a generic version that would accept any kind of map:
To handle the map, we defined two kinds of type parameters. First, the values can be of any type: V any. However, in Go, the map keys can’t be of any type. For example, we cannot use slices:
+
varmmap[[]byte]int
+
+
This code leads to a compilation error: invalid map key type []byte. Therefore, instead of accepting any key type, we are obliged to restrict type arguments so that the key type meets specific requirements. Here, being comparable (we can use == or !=). Hence, we defined K as comparable instead of any.
+
Restricting type arguments to match specific requirements is called a constraint. A constraint is an interface type that can contain:
+
+
A set of behaviors (methods)
+
But also arbitrary type
+
+
Let’s see a concrete example for the latter. Imagine we don’t want to accept any comparable type for map key type. For instance, we would like to restrict it to either int or string types. We can define a custom constraint this way:
+
typecustomConstraintinterface{
+~int|~string// Define a custom type that will restrict types to int and string
+}
+
+// Change the type parameter K to be custom
+funcgetKeys[KcustomConstraint,Vany](mmap[K]V)[]K{
+// Same implementation
+}
+
+
First, we define a customConstraint interface to restrict the types to be either int or string using the union operator | (we will discuss the use of ~ a bit later). Then, K is now a customConstraint instead of a comparable as before.
+
Now, the signature of getKeys enforces that we can call it with a map of any value type, but the key type has to be an int or a string. For example, on the caller-side:
Note that Go can infer that getKeys is called with a string type argument. The previous call was similar to this:
+
keys:=getKeys[string](m)
+
+
+Note
+
What’s the difference between a constraint using ~int or int? Using int restricts it to that type, whereas ~int restricts all the types whose underlying type is an int.
+
To illustrate it, let’s imagine a constraint where we would like to restrict a type to any int type implementing the String() string method:
As customInt is an int and implements the String() string method, the customInt type satisfies the constraint defined.
+
However, if we change the constraint to contain an int instead of an ~int, using customInt would lead to a compilation error because the int type doesn’t implement String() string.
+
+
Let’s also note the constraints package contains a set of common constraints such as Signed that includes all the signed integer types. Let’s ensure that a constraint doesn’t already exist in this package before creating a new one.
+
So far, we have discussed examples using generics for functions. However, we can also use generics with data structures.
+
For example, we will create a linked list containing values of any type. Meanwhile, we will write an Add method to append a node:
+
typeNode[Tany]struct{// Use type parameter
+ValT
+next*Node[T]
+}
+
+func(n*Node[T])Add(next*Node[T]){// Instantiate type receiver
+n.next=next
+}
+
+
We use type parameters to define T and use both fields in Node. Regarding the method, the receiver is instantiated. Indeed, because Node is generic, it has to follow also the type parameter defined.
+
One last thing to note about type parameters: they can’t be used on methods, only on functions. For example, the following method wouldn’t compile:
+
typeFoostruct{}
+
+func(Foo)bar[Tany](tT){}
+
+
./main.go:29:15: methods cannot have type parameters
+
+
Now, let’s delve into concrete cases where we should and shouldn’t use generics.
+
Common uses and misuses
+
So when are generics useful? Let’s discuss a couple of common uses where generics are recommended:
+
+
Data structures. For example, we can use generics to factor out the element type if we implement a binary tree, a linked list, or a heap.
+
Functions working with slices, maps, and channels of any type. For example, a function to merge two channels would work with any channel type. Hence, we could use type parameters to factor out the channel type:
Meanwhile, instead of factoring out a type, we can factor out behaviors. For example, the sort package contains functions to sort different slice types such as sort.Ints or sort.Float64s. Using type parameters, we can factor out the sorting behaviors that rely on three methods, Len, Less, and Swap:
+
+
typesliceFn[Tany]struct{// Use type parameter
+s[]T
+comparefunc(T,T)bool// Compare two T elements
+}
+
+func(ssliceFn[T])Len()int{returnlen(s.s)}
+func(ssliceFn[T])Less(i,jint)bool{returns.compare(s.s[i],s.s[j])}
+func(ssliceFn[T])Swap(i,jint){s.s[i],s.s[j]=s.s[j],s.s[i]}
+
+
Conversely, when is it recommended not to use generics?
+
+
When just calling a method of the type argument. For example, consider a function that receives an io.Writer and call the Write method:
When it makes our code more complex. Generics are never mandatory, and as Go developers, we have been able to live without them for more than a decade. If writing generic functions or structures we figure out that it doesn’t make our code clearer, we should probably reconsider our decision for this particular use case.
+
+
Conclusion
+
Though generics can be very helpful in particular conditions, we should be cautious about when to use them and not use them.
+
In general, when we want to answer when not to use generics, we can find similarities with when not to use interfaces. Indeed, generics introduce a form of abstraction, and we have to remember that unnecessary abstractions introduce complexity.
+
Let’s not pollute our code with needless abstractions, and let’s focus on solving concrete problems for now. It means that we shouldn’t use type parameters prematurely. Let’s wait until we are about to write boilerplate code to consider using generics.
Writing concurrent code that leads to false sharing
+
+
In previous sections, we have discussed the fundamental concepts of CPU caching. We have seen that some specific caches (typically, L1 and L2) aren’t shared among all the logical cores but are specific to a physical core. This specificity has some concrete impacts such as concurrency and the concept of false sharing, which can lead to a significant performance decrease. Let’s look at what false sharing is via an example and then see how to prevent it.
+
In this example, we use two structs, Input and Result:
We spin up two goroutines: one that iterates over each a field and another that iterates over each b field. This example is fine from a concurrency perspective. For instance, it doesn’t lead to a data race, because each goroutine increments its own variable. But this example illustrates the false sharing concept that degrades expected performance.
+
Let’s look at the main memory. Because sumA and sumB are allocated contiguously, in most cases (seven out of eight), both variables are allocated to the same memory block:
+
+
Now, let’s assume that the machine contains two cores. In most cases, we should eventually have two threads scheduled on different cores. So if the CPU decides to copy this memory block to a cache line, it is copied twice:
+
+
Both cache lines are replicated because L1D (L1 data) is per core. Recall that in our example, each goroutine updates its own variable: sumA on one side, and sumB on the other side:
+
+
Because these cache lines are replicated, one of the goals of the CPU is to guarantee cache coherency. For example, if one goroutine updates sumA and another reads sumA (after some synchronization), we expect our application to get the latest value.
+
However, our example doesn’t do exactly this. Both goroutines access their own variables, not a shared one. We might expect the CPU to know about this and understand that it isn’t a conflict, but this isn’t the case. When we write a variable that’s in a cache, the granularity tracked by the CPU isn’t the variable: it’s the cache line.
+
When a cache line is shared across multiple cores and at least one goroutine is a writer, the entire cache line is invalidated. This happens even if the updates are logically independent (for example, sumA and sumB). This is the problem of false sharing, and it degrades performance.
+
+Note
+
Internally, a CPU uses the MESI protocol to guarantee cache coherency. It tracks each cache line, marking it modified, exclusive, shared, or invalid (MESI).
+
+
One of the most important aspects to understand about memory and caching is that sharing memory across cores isn’t real—it’s an illusion. This understanding comes from the fact that we don’t consider a machine a black box; instead, we try to have mechanical sympathy with underlying levels.
+
So how do we solve false sharing? There are two main solutions.
+
The first solution is to use the same approach we’ve shown but ensure that sumA and sumB aren’t part of the same cache line. For example, we can update the Result struct to add padding between the fields. Padding is a technique to allocate extra memory. Because an int64 requires an 8-byte allocation and a cache line 64 bytes long, we need 64 – 8 = 56 bytes of padding:
The next figure shows a possible memory allocation. Using padding, sumA and sumB will always be part of different memory blocks and hence different cache lines.
+
+
If we benchmark both solutions (with and without padding), we see that the padding solution is significantly faster (about 40% on my machine). This is an important improvement that results from the addition of padding between the two fields to prevent false sharing.
+
The second solution is to rework the structure of the algorithm. For example, instead of having both goroutines share the same struct, we can make them communicate their local result via channels. The result benchmark is roughly the same as with padding.
+
In summary, we must remember that sharing memory across goroutines is an illusion at the lowest memory levels. False sharing occurs when a cache line is shared across two cores when at least one goroutine is a writer. If we need to optimize an application that relies on concurrency, we should check whether false sharing applies, because this pattern is known to degrade application performance. We can prevent false sharing with either padding or communication.
Go offers a few excellent diagnostics tools to help us get insights into how an application performs. This post focuses on the most important ones: profiling and the execution tracer. Both tools are so important that they should be part of the core toolset of any Go developer who is interested in optimization. First, let’s discuss profiling.
+
Profiling
+
Profiling provides insights into the execution of an application. It allows us to resolve performance issues, detect contention, locate memory leaks, and more. These insights can be collected via several profiles:
+
+
CPU— Determines where an application spends its time
+
Goroutine— Reports the stack traces of the ongoing goroutines
+
Heap— Reports heap memory allocation to monitor current memory usage and check for possible memory leaks
+
Mutex— Reports lock contentions to see the behaviors of the mutexes used in our code and whether an application spends too much time in locking calls
+
Block— Shows where goroutines block waiting on synchronization primitives
+
+
Profiling is achieved via instrumentation using a tool called a profiler, in Go: pprof. First, let’s understand how and when to enable pprof; then, we discuss the most critical profile types.
+
Enabling pprof
+
There are several ways to enable pprof. For example, we can use the net/http/pprof package to serve the profiling data via HTTP:
Importing net/http/pprof leads to a side effect that allows us to reach the pprof URL: http://host/debug/pprof. Note that enabling pprof is safe even in production (https://go.dev/doc/diagnostics#profiling). The profiles that impact performance, such as CPU profiling, aren’t enabled by default, nor do they run continuously: they are activated only for a specific period.
+
Now that we have seen how to expose a pprof endpoint, let’s discuss the most common profiles.
+
CPU Profiling
+
The CPU profiler relies on the OS and signaling. When it is activated, the application asks the OS to interrupt it every 10 ms by default via a SIGPROF signal. When the application receives a SIGPROF, it suspends the current activity and transfers the execution to the profiler. The profiler collects data such as the current goroutine activity and aggregates execution statistics that we can retrieve. Then it stops, and the execution resumes until the next SIGPROF.
+
We can access the /debug/pprof/profile endpoint to activate CPU profiling. Accessing this endpoint executes CPU profiling for 30 seconds by default. For 30 seconds, our application is interrupted every 10 ms. Note that we can change these two default values: we can use the seconds parameter to pass to the endpoint how long the profiling should last (for example, /debug/pprof/profile?seconds=15), and we can change the interruption rate (even to less than 10 ms). But in most cases, 10 ms should be enough, and in decreasing this value (meaning increasing the rate), we should be careful not to harm performance. After 30 seconds, we download the results of the CPU profiler.
+
+Note
+
We can also enable the CPU profiler using the -cpuprofile flag, such as when running a benchmark. For example, the following command produces the same type of file that can be downloaded via /debug/ pprof/profile.
+
$ go test -bench=. -cpuprofile profile.out
+
+
+
From this file, we can navigate to the results using go tool:
+
$ go tool pprof -http=:8080 <file>
+
+
This command opens a web UI showing the call graph. The next figure shows an example taken from an application. The larger the arrow, the more it was a hot path. We can then navigate into this graph and get execution insights.
+
+
For example, the graph in the next figure tells us that during 30 seconds, 0.06 seconds were spent in the decode method (*FetchResponse receiver). Of these 0.06 seconds, 0.02 were spent in RecordBatch.decode and 0.01 in makemap (creating a map).
+
+
We can also access this kind of information from the web UI with different representations. For example, the Top view sorts the functions per execution time, and Flame Graph visualizes the execution time hierarchy. The UI can even display the expensive parts of the source code line by line.
+
+Note
+
We can also delve into profiling data via a command line. However, we focus on the web UI in this post.
+
+
Thanks to this data, we can get a general idea of how an application behaves:
+
+
Too many calls to runtime.mallogc can mean an excessive number of small heap allocations that we can try to minimize.
+
Too much time spent in channel operations or mutex locks can indicate excessive contention that is harming the application’s performance.
+
Too much time spent on syscall.Read or syscall.Write means the application spends a significant amount of time in Kernel mode. Working on I/O buffering may be an avenue for improvement.
+
+
These are the kinds of insights we can get from the CPU profiler. It’s valuable to understand the hottest code path and identify bottlenecks. But it won’t determine more than the configured rate because the CPU profiler is executed at a fixed pace (by default, 10 ms). To get finer-grained insights, we should use tracing, which we discuss later in this post.
+
+Note
+
We can also attach labels to the different functions. For example, imagine a common function called from different clients. To track the time spent for both clients, we can use pprof.Labels.
+
+
Heap Profiling
+
Heap profiling allows us to get statistics about the current heap usage. Like CPU profiling, heap profiling is sample-based. We can change this rate, but we shouldn’t be too granular because the more we decrease the rate, the more effort heap profiling will require to collect data. By default, samples are profiled at one allocation for every 512 KB of heap allocation.
+
If we reach /debug/pprof/heap/, we get raw data that can be hard to read. However, we can download a heap profile using /debug/pprof/heap/?debug=0 and then open it with go tool (the same command as in the previous section) to navigate into the data using the web UI.
+
The next figure shows an example of a heap graph. Calling the MetadataResponse.decode method leads to allocating 1536 KB of heap data (which represents 6.32% of the total heap). However, 0 out of these 1536 KB were allocated by this function directly, so we need to inspect the second call. The TopicMetadata.decode method allocated 512 KB out of the 1536 KB; the rest — 1024 KB — were allocated in another method.
+
+
This is how we can navigate the call chain to understand what part of an application is responsible for most of the heap allocations. We can also look at different sample types:
+
+
alloc_objects— Total number of objects allocated
+
alloc_space— Total amount of memory allocated
+
inuse_objects — Number of objects allocated and not yet released
+
inuse_space— Amount of memory allocated and not yet released
+
+
Another very helpful capability with heap profiling is tracking memory leaks. With a GC-based language, the usual procedure is the following:
+
+
Trigger a GC.
+
Download heap data.
+
Wait for a few seconds/minutes.
+
Trigger another GC.
+
Download another heap data.
+
Compare.
+
+
Forcing a GC before downloading data is a way to prevent false assumptions. For example, if we see a peak of retained objects without running a GC first, we cannot be sure whether it’s a leak or objects that the next GC will collect.
+
Using pprof, we can download a heap profile and force a GC in the meantime. The procedure in Go is the following:
+
+
Go to /debug/pprof/heap?gc=1 (trigger the GC and download the heap profile).
+
Wait for a few seconds/minutes.
+
Go to /debug/pprof/heap?gc=1 again.
+
Use go tool to compare both heap profiles:
+
+
$ go tool pprof -http=:8080 -diff_base <file2> <file1>
+
+
The next figure shows the kind of data we can access. For example, the amount of heap memory held by the newTopicProducer method (top left) has decreased (–513 KB). In contrast, the amount held by updateMetadata (bottom right) has increased (+512 KB). Slow increases are normal. The second heap profile may have been calculated in the middle of a service call, for example. We can repeat this process or wait longer; the important part is to track steady increases in allocations of a specific object.
+
+
+Note
+
Another type of profiling related to the heap is allocs, which reports allocations. Heap profiling shows the current state of the heap memory. To get insights about past memory allocations since the application started, we can use allocations profiling. As discussed, because stack allocations are cheap, they aren’t part of this profiling, which only focuses on the heap.
+
+
Goroutine Profiling
+
The goroutine profile reports the stack trace of all the current goroutines in an application. We can download a file using /debug/pprof/goroutine/?debug=0 and use go tool again. The next figure shows the kind of information we can get.
+
+
We can see the current state of the application and how many goroutines were created per function. In this case, withRecover has created 296 ongoing goroutines (63%), and 29 were related to a call to responseFeeder.
+
This kind of information is also beneficial if we suspect goroutine leaks. We can look at goroutine profiler data to know which part of a system is the suspect.
+
Block Profiling
+
The block profile reports where ongoing goroutines block waiting on synchronization primitives. Possibilities include
+
+
Sending or receiving on an unbuffered channel
+
Sending to a full channel
+
Receiving from an empty channel
+
Mutex contention
+
Network or filesystem waits
+
+
Block profiling also records the amount of time a goroutine has been waiting and is accessible via /debug/pprof/block. This profile can be extremely helpful if we suspect that performance is being harmed by blocking calls.
+
The block profile isn’t enabled by default: we have to call runtime.SetBlockProfileRate to enable it. This function controls the fraction of goroutine blocking events that are reported. Once enabled, the profiler will keep collecting data in the background even if we don’t call the /debug/pprof/block endpoint. Let’s be cautious if we want to set a high rate so we don’t harm performance.
+
+Note
+
If we face a deadlock or suspect that goroutines are in a blocked state, the full goroutine stack dump (/debug/pprof/goroutine/?debug=2) creates a dump of all the current goroutine stack traces. This can be helpful as a first analysis step. For example, the following dump shows a Sarama goroutine blocked for 1,420 minutes on a channel-receive operation:
The last profile type is related to blocking but only regarding mutexes. If we suspect that our application spends significant time waiting for locking mutexes, thus harming execution, we can use mutex profiling. It’s accessible via /debug/pprof/mutex.
+
This profile works in a manner similar to that for blocking. It’s disabled by default: we have to enable it using runtime.SetMutexProfileFraction, which controls the fraction of mutex contention events reported.
+
Following are a few additional notes about profiling:
Be sure to enable only one profiler at a time: for example, do not enable CPU and heap profiling simultaneously. Doing so can lead to erroneous observations.
+
pprof is extensible, and we can create our own custom profiles using pprof.Profile.
+
+
We have seen the most important profiles that we can enable to help us understand how an application performs and possible avenues for optimization. In general, enabling pprof is recommended, even in production, because in most cases it offers an excellent balance between its footprint and the amount of insight we can get from it. Some profiles, such as the CPU profile, lead to performance penalties but only during the time they are enabled.
+
Let’s now look at the execution tracer.
+
Execution Tracer
+
The execution tracer is a tool that captures a wide range of runtime events with go tool to make them available for visualization. It is helpful for the following:
+
+
Understanding runtime events such as how the GC performs
+
Understanding how goroutines execute
+
Identifying poorly parallelized execution
+
+
Let’s try it with an example given the Concurrency isn’t Always Faster in Go section. We discussed two parallel versions of the merge sort algorithm. The issue with the first version was poor parallelization, leading to the creation of too many goroutines. Let’s see how the tracer can help us in validating this statement.
+
We will write a benchmark for the first version and execute it with the -trace flag to enable the execution tracer:
+
$ go test -bench=. -v -trace=trace.out
+
+
+Note
+
We can also download a remote trace file using the /debug/pprof/ trace?debug=0 pprof endpoint.
+
+
This command creates a trace.out file that we can open using go tool:
+
$ go tool trace trace.out
+2021/11/26 21:36:03 Parsing trace...
+2021/11/26 21:36:31 Splitting trace...
+2021/11/26 21:37:00 Opening browser. Trace viewer is listening on
+ http://127.0.0.1:54518
+
+
The web browser opens, and we can click View Trace to see all the traces during a specific timeframe, as shown in the next figure. This figure represents about 150 ms. We can see multiple helpful metrics, such as the goroutine count and the heap size. The heap size grows steadily until a GC is triggered. We can also observe the activity of the Go application per CPU core. The timeframe starts with user-level code; then a “stop the
+world” is executed, which occupies the four CPU cores for approximately 40 ms.
+
+
Regarding concurrency, we can see that this version uses all the available CPU cores on the machine. However, the next figure zooms in on a portion of 1 ms. Each bar corresponds to a single goroutine execution. Having too many small bars doesn’t look right: it means execution that is poorly parallelized.
+
+
The next figure zooms even closer to see how these goroutines are orchestrated. Roughly 50% of the CPU time isn’t spent executing application code. The white spaces represent the time the Go runtime takes to spin up and orchestrate new goroutines.
+
+
Let’s compare this with the second parallel implementation, which was about an order of magnitude faster. The next figure again zooms to a 1 ms timeframe.
+
+
Each goroutine takes more time to execute, and the number of white spaces has been significantly reduced. Hence, the CPU is much more occupied executing application code than it was in the first version. Each millisecond of CPU time is spent more efficiently, explaining the benchmark differences.
+
Note that the granularity of the traces is per goroutine, not per function like CPU profiling. However, it’s possible to define user-level tasks to get insights per function or group of functions using the runtime/trace package.
+
For example, imagine a function that computes a Fibonacci number and then writes it to a global variable using atomic. We can define two different tasks:
+
varvint64
+// Creates a fibonacci task
+ctx,fibTask:=trace.NewTask(context.Background(),"fibonacci")
+trace.WithRegion(ctx,"main",func(){
+v=fibonacci(10)
+})
+fibTask.End()
+
+// Creates a store task
+ctx,fibStore:=trace.NewTask(ctx,"store")
+trace.WithRegion(ctx,"main",func(){
+atomic.StoreInt64(&result,v)
+})
+fibStore.End()
+
+
Using go tool, we can get more precise information about how these two tasks perform. In the previous trace UI, we can see the boundaries for each task per goroutine. In User-Defined Tasks, we can follow the duration distribution:
+
+
We see that in most cases, the fibonacci task is executed in less than 15 microseconds, whereas the store task takes less than 6309 nanoseconds.
+
In the previous section, we discussed the kinds of information we can get from CPU profiling. What are the main differences compared to the data we can get from user-level traces?
+
+
CPU profiling:
+
Sample-based
+
Per function
+
Doesn’t go below the sampling rate (10 ms by default)
+
+
+
User-level traces:
+
Not sample-based
+
Per-goroutine execution (unless we use the runtime/trace package)
+
Time executions aren’t bound by any rate
+
+
+
+
In summary, the execution tracer is a powerful tool for understanding how an application performs. As we have seen with the merge sort example, we can identify poorly parallelized execution. However, the tracer’s granularity remains per goroutine unless we manually use runtime/trace compared to a CPU profile, for example. We can use both profiling and the execution tracer to get the most out of the standard Go diagnostics tools when optimizing an application.