So far we've seen how to verify the correctness of our programs, aka as our programs do what they're supposed to.
To do so we've used golint
, go vet
, and go test
.
Correctness is necessary, but not sufficient. We also need our programs to be fast. And how do we measure that?
Let's work again with the web server program we used before for debugging. There is a slightly modified version of the code in the webserver directory.
package main
import (
"fmt"
"log"
"net/http"
"regexp"
)
func main() {
http.HandleFunc("/", handler)
log.Printf("listening on localhost:8080")
log.Fatal(http.ListenAndServe("localhost:8080", nil))
}
func handler(w http.ResponseWriter, r *http.Request) {
if name, ok := isGopher(r.URL.Path[1:]); ok {
fmt.Fprintf(w, "hello, %s", name)
return
}
fmt.Fprintln(w, "hello, stranger")
}
func isGopher(email string) (string, bool) {
re := regexp.MustCompile("^([[:alpha:]]+)@itu.dk$")
match := re.FindStringSubmatch(email)
if len(match) == 2 {
return match[1], true
}
return "", false
}
A pretty straight forward way to measure the performance of any job is to call it and track the time it takes to finish. It is not a perfect measure, but it is a useful one to start with.
In the case of a program that runs for a limited time (unlike servers, which
run indefintely) I recommend using the time
command.
$ go get github.com/golang/example/hello
$ time $GOPATH/bin/hello
Hello, Go examples!
$GOPATH/bin/hello 0.00s user 0.00s system 15% cpu 0.016 total
Unfortunately, this doesn't work for web servers, where we would need to track how long it takes to answer a single request.
There's many tools that provide this functionality, such as
Apache Bench. In this
workshop we will use go-wrk, which can be installed with go get
:
$ go get github.com/tsliwowicz/go-wrk
Start running the webserver
binary, then run go-wrk
to get an analysis of
the server's performance.
$ cd 3-dynamic-analysis/3-profiling/webserver
$ go run main.go
Now run go-wrk in a different terminal:
$ go-wrk -d 5 http://localhost:8080
- How many requests can the server handle per second? What is the mean response time? How long took the slowest request?
With the previous exercise we were able to put a number on how fast, or slow, our server performs. But this doesn't necessarily helps us figuring out what piece of code we should change. We need a deeper analysis.
How can we measure the performance of specific pieces of code? It's almost like unit testing but for performance analysis. Well, those are benchmarks.
In Go, benchmarks are very similar to unit tests. Simply run a function
named BenchmarkXXX
, where XXX
refers to what you're benchmarking,
with type func (*testing.B)
.
How fast is the function strings.Fields
? We can write
benchmark for that:
var res []string
func BenchmarkFields(b *testing.B) {
for i := 0; i < b.N; i++ {
res = strings.Fields("hello, dear friend")
}
}
You can then run this benchmark with go test
, providing the -bench
flag and a regular expression so only the tests whose name match will
be executed. To execute all of them, you can use -bench=.
.
Note that we repeat the operation we want to benchmark (in this case
strings.Field
) b.N
times. This value will be changed by go test
until the resulting times are statistically significant.
$ go test -bench=.
BenchmarkFields-8 5000000 259 ns/op
PASS
ok _/Users/jofu/Dropbox/ITU/teaching/Scalability_of_Web_Systems/scalable_web_systems_develop.git/session-05/3-profiling/fields 1.570s
The number of memory allocations is also a good indicator of how well a function
might perform at scale, and you can obtain it by adding -benchmem
.
$ go test -bench=. -benchmem
BenchmarkFields-8 5000000 253 ns/op 48 B/op 1 allocs/op
PASS
ok _/Users/jofu/Dropbox/ITU/teaching/Scalability_of_Web_Systems/scalable_web_systems_develop.git/session-05/3-profiling/fields 1.537s
- What happens when you don't store the result of
strings.Fields
inres
? Why?
Sometimes your benchmarks might be way faster that you would expect. When that happens do not be too happy, it might be caused by a compiler optimization that simply removed the function call you thought you were measuring.
For instance, let's see this code:
func div(a, b int) int {
return int(float64(a) / float64(b))
}
func BenchmarkDiv(b *testing.B) {
for i := 0; i < b.N; i++ {
div(2, 3)
}
}
If we run the benchmark we will get a surprising result:
$ go test -bench=Div
PASS
ok _/Users/jofu/Dropbox/ITU/teaching/Scalability_of_Web_Systems/scalable_web_systems_develop.git/session-05/3-profiling/div 0.006s
This seems a bit too fast, even for a simple piece of code as the one we're benchmarking. Also do you notice something missing from the output?
We can see the generated code by adding -gcflags "-S"
:
$ go test benchmarks_test.go -bench=Div -gcflags "-S" &> out.text
BenchmarkDiv-8 2000000000 0.28 ns/op
PASS
ok command-line-arguments 0.612s
$ cat out.text
# omitted content
"".BenchmarkDiv STEXT nosplit size=25 args=0x8 locals=0x0
0x0000 00000 (benchmarks_test.go:23) TEXT "".BenchmarkDiv(SB), NOSPLIT, $0-8
0x0000 00000 (benchmarks_test.go:23) FUNCDATA $0, gclocals·a36216b97439c93dafebe03e7f0808b5(SB)
0x0000 00000 (benchmarks_test.go:23) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (benchmarks_test.go:23) MOVQ "".b+8(SP), AX
0x0005 00005 (benchmarks_test.go:23) MOVL $0, CX
0x0007 00007 (benchmarks_test.go:24) JMP 12
0x0009 00009 (benchmarks_test.go:24) INCQ CX
0x000c 00012 (benchmarks_test.go:24) MOVQ 240(AX), DX
0x0013 00019 (benchmarks_test.go:24) CMPQ CX, DX
0x0016 00022 (benchmarks_test.go:24) JLT 9
0x0018 00024 (benchmarks_test.go:27) RET
# omitted content
You don't need to understand the whole thing, but the point is that there's no function call
The problem here is that the function call to div
has been inlined, and the inlined code
has been deleted as it can be proven that its result is never used.
- You can see when a function call is inlined by using
gcflag "-m"
. Try it.
A possible fix to this would be to use the noinline
pragma, to avoid having div
inlined.
This would fix the problem, but would alter the results of our performance analysis too.
A better idea is to store the result of the function call in a package variable. This will force the compiler to keep the call.
var s int
func BenchmarkDiv_Escape(b *testing.B) {
for i := 0; i < b.N; i++ {
s = div(2, 3)
}
}
The generated assembly will now contain a division instruction DIVSD
showing that the
function was inlined but the code is still there.
$ go test benchmarks_test.go -bench=Div_Escape -gcflags "-S" 2>& out.text
BenchmarkDiv_Escape-8 500000000 3.75 ns/op
PASS
ok command-line-arguments 2.277s
$ cat out.text
# omitted content
"".BenchmarkDiv_Escape STEXT nosplit size=67 args=0x8 locals=0x0
0x0000 00000 (benchmarks_test.go:30) TEXT "".BenchmarkDiv_Escape(SB), NOSPLIT, $0-8
0x0000 00000 (benchmarks_test.go:30) FUNCDATA $0, gclocals·a36216b97439c93dafebe03e7f0808b5(SB)
0x0000 00000 (benchmarks_test.go:30) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (benchmarks_test.go:30) MOVQ "".b+8(SP), AX
0x0005 00005 (benchmarks_test.go:30) MOVL $0, CX
0x0007 00007 (benchmarks_test.go:31) JMP 54
0x0009 00009 (benchmarks_test.go:32) MOVL $2, DX
0x000e 00014 (benchmarks_test.go:32) XORPS X0, X0
0x0011 00017 (benchmarks_test.go:32) CVTSQ2SD DX, X0
0x0016 00022 (benchmarks_test.go:32) MOVL $3, BX
0x001b 00027 (benchmarks_test.go:32) XORPS X1, X1
0x001e 00030 (benchmarks_test.go:32) CVTSQ2SD BX, X1
0x0023 00035 (benchmarks_test.go:32) DIVSD X1, X0 # a division
0x0027 00039 (benchmarks_test.go:32) CVTTSD2SQ X0, SI
0x002c 00044 (benchmarks_test.go:32) MOVQ SI, "".s(SB)
0x0033 00051 (benchmarks_test.go:31) INCQ CX
0x0036 00054 (benchmarks_test.go:31) MOVQ 240(AX), DX
0x003d 00061 (benchmarks_test.go:31) CMPQ CX, DX
0x0040 00064 (benchmarks_test.go:31) JLT 9
0x0042 00066 (benchmarks_test.go:34) RET
# omitted content
Write a benchmark for the isGopher
function in webserver/main.go.
Do not change the implementation of isGopher
just yet.
How fast is it? How many allocations per operation does it require?
Write a benchmark for the sum package we tested before.
Try writing an alternative implementation of Sum
that uses iteration rather
than recursion. Which one is faster? By how much?
Is there a big difference in performance between for _, v := range vs
and
for i := 0; i < len(vs); i++
? Why do you think that is?
You just learned how to find out how well a server performs, and how to measure specific parts of your code via benchmarks.
This is good, but it still doesn't help us understand which piece of code is the one that we should optimize first! To do so, we'll learn how to use pprof during the next chapter.