Blog Technical posts

Go race detector

22 September 2016 Giovanni Bajo · CTO

Introduction

One of Go’s strengths is its rich toolchain, which integrates many features such as a build system, a package manager, an automated test system, a profiler, and much more.

The race detector is one of the most advanced features present in the Go toolchain, which (as we shall see) is very useful for debugging both competition and locking problems. As you already probably aware, Go is known for its powerful concurrent programming support (based on the writing of “blocking” style code that automatically becomes asynchronous thanks to runtime managed coroutines), and as a result many programs written in Go tend to benefit from this support, performing tens or even thousands of goroutines. The race detector is designed to facilitate the debugging of concurrent software, helping you to identify the race conditions that can occur as a result of typical bugs such as the lack of a mutex.

The competition and the shared status

Go uses a shared memory model for goroutines, just like in C ++ or Python for threads; this means that each goroutine has access to all the memory of the process in which it runs, and therefore some caution is needed in accessing and modifying the shared status. Typically, this means using synchronisation primitives such as semaphores or mutexes, or taking advantage of special atomic memory access instructions available on most processors.

Forgetting to lock in the right place is one of the most insidious bug sources. The program can apparently work correctly during development or in the first few tests in production; but then risks having unpredictable and difficult reproducible behaviours , causing very annoying heisenbugs Unfortunately, in the vast majority of cases, developers do not have tools available to help them notice these problems, and the correctness of the program is therefore entrusted to the skill and attention of the writer of the code and of those persons that modify it. And I can assure you that I’ve seen such bugs in the code written by highly experienced programmers!

The problem is undoubtedly insidious in itself, and in a sense it is also exacerbated by a language with a powerful and fast support to the competition such as Go. In Go it is so easy and efficient to write concurrent code, that it is normal to abuse it a great deal more than in other languages, and this risks triggering a negative spiral that increasingly distances the correctness of the code … except that the Go authors thought to help programmers by providing a powerful race detector with just a few keys away.

Example: shared counter

Let’s take as an example a simple Go counter.go program that exposes a TCP server and counts the number of clients connecting. The code I report was written a little richer than the bare minimum, because I want to show a realistic case: I implemented a Server class with a Serve, blocking method, and a handleClient method that is called, in a separate goroutine, for each client that connects.

// counter.go: simple race detection example
package main

import (
    "fmt"
    "io"
    "log"
    "net"
)

type Server struct {
    conn       net.Listener
    numClients int
}

// NewServer creates a new Server that will listen on the specified proto/addr combo.
// See net.Dial for documentation on proto and addr.
func NewServer(proto, addr string) (*Server, error) {
    conn, err := net.Listen(proto, addr)
    if err != nil {
        return nil, err
    }

    return &Server{conn: conn}, nil
}

// Serve makes Server listen for incoming connection, and spawn a goroutine calling handleClient
// for each new connection.
func (srv *Server) Serve() {
    for {
        conn, err := srv.conn.Accept()
        if err != nil {
            log.Print(err)
            return
        }

        srv.numClients += 1
        go srv.handleClient(conn)
    }
}

// handleClient manages the communication with a single client.
// In this example, we just send a predefined message and close the door
func (srv *Server) handleClient(conn net.Conn) {
    io.WriteString(conn, fmt.Sprintf("Ciao, sei il client #%d che si connette a me\n", srv.numClients))
    conn.Close()
}

func main() {
    srv, err := NewServer("tcp", "localhost:2380")
    if err != nil {
        log.Fatal(err)
    }

    srv.Serve()
}

To run and test this program, in a terminal we run go run counter.go, while in another we try to execute telnet localhost 2380 several times.. We should see something like this:

$ telnet localhost 2380
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Ciao, sei il client #1 che si connette a me
Connection closed by foreign host.

$ telnet localhost 2380
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Ciao, sei il client #2 che si connette a me
Connection closed by foreign host.

$ telnet localhost 2380
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Ciao, sei il client #3 che si connette a me
Connection closed by foreign host.

As you can see, apparently the program works correctly. But is this really the case? At this point, let’s try running the program by activating the race detector: just pass the -race a go run: then go run -race counter.go option. If we now try to connect with telnet, the first time everything will be fine, but the second time we will suddenly see this output appear in the terminal where the server is running:

$ go run -race counter.go
==================
WARNING: DATA RACE
Write at 0x00c420086190 by main goroutine:
    main.(*Server).Serve()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:37 +0xae
    main.main()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:55 +0x86

Previous read at 0x00c420086190 by goroutine 7:
    runtime.convT2E()
        /usr/local/Cellar/go/1.7/libexec/src/runtime/iface.go:155 +0x0
    main.(*Server).handleClient()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:45 +0x69

Goroutine 7 (finished) created at:
    main.(*Server).Serve()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:38 +0xf0
    main.main()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:55 +0x86
==================

As you can see the race detector has identified a data race. It was noticed that two goroutines performed a write and read at the same memory location (in this case: 0x00c420086190) without an explicit synchronisation between them; this shows us the stack trace of each read / write, the ID and the stack-trace of creation of each goroutine.

In this case, to paraphrase the above, we can say that:

The “main” goroutine (the one to start the program) has written to the line counter.go:37
The goroutine 7 performed a read at the same memory location from within the Go runtime, but the stack trace indicates that the call was in any case generated by our code on the line counter.go:45
The goroutine 7 was created at the counter.go:38 position.

If we then look at the code, we see that the race detector warns us that the increment of the numClients variable and the read that is made to print the value are in potential conflict with each other. In fact, there are no synchronisations between these two statements.

It is important to note that the race detector noticed the problem despite our telnet connections being completely sequential and not parallel. In other words, the race detector is able to identify competition problems without them actually occurring. It is therefore not necessary to hope and pray that the problem occurs while the race detector is active: it is sufficient to execute the code to be tested in a semi-realistic condition and the race detector will do its job anyway.

Data race control via testsuite

Manually testing a TCP server can be a tedious task, and, as you know, programmers are some of the laziest professionals on this planet. It is therefore always advisable to have an automated testsuite available, and Go helps us by providing libraries and convenient support integrated within the toolchain.

So with the aim of being able to debug and resolve the data race more comfortably, let’s see how to write a simple test of our server: this is the content of counter_test.go.

// counter_test.go
package main

import (
    "bufio"
    "fmt"
    "net"
    "strings"
    "testing"
)

func TestServer(t *testing.T) {
    srv, err := NewServer("tcp", "localhost:2380")
    if err != nil {
        t.Fatal(err)
    }

    go srv.Serve()
    defer srv.Close()

    for i := 0; i < 5; i++ {
        c, err := net.Dial("tcp", "localhost:2380")
        if err != nil {
            t.Error(err)
            return
        }
        defer c.Close()

        line, err := bufio.NewReader(c).ReadString('\n')
        if err != nil || !strings.Contains(line, fmt.Sprintf("#%d ", i+1)) {
            t.Errorf("invalid text received: %q (err:%v)", line, err)
            return
        }
    }
}

To run this test, it is sufficient to launch go test in the project directory. Obviously, the version with data race will seem to work perfectly:

$ go test
PASS
ok         _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d  0.016s

But even here, it is sufficient to add -race to notice the problem:

$ go test -race
==================
WARNING: DATA RACE
Write at 0x00c420082190 by goroutine 8:
    _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d.(*Server).Serve()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:37 +0xae

Previous read at 0x00c420082190 by goroutine 10:
    runtime.convT2E()
        /usr/local/Cellar/go/1.7/libexec/src/runtime/iface.go:155 +0x0
    _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d.(*Server).handleClient()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:49 +0x69

Goroutine 8 (running) created at:
    _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d.TestServer()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter_test.go:18 +0xe4
    testing.tRunner()
        /usr/local/Cellar/go/1.7/libexec/src/testing/testing.go:610 +0xc9

Goroutine 10 (finished) created at:
    _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d.(*Server).Serve()
        /Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d/counter.go:38 +0xf0
==================
PASS
2016/08/20 17:30:32 accept tcp 127.0.0.1:2380: use of closed network connection
Found 1 data race(s)
exit status 66
FAIL       _/Users/rasky/Sources/develer/e4daef8b5f9770c38439bf2310bc7b5d  1.029s

As you can see, the test itself is passed (PASS) because a function of the test library was not invoked to mark an error (such as t.Error), but the entire testsuite is marked as FAIL because a data race was found during the execution. Even if the data race has not in itself caused a malfunction such as to make the test fail, Go suggests that there are still important problems to fix.

It is good practice to use a continuous integration system such as Travis CI or Circle CI to run the testsuite on each commit made by the team. In this case, for Go software, the CI should always run the testsuite via go test -race, in order to notice concurrency problems as soon as possible.

How to solve a data race

How to solve the problem identified by the race detector? A first approach might be to introduce a mutex to synchronise accesses with each other. This is a counter_mutex.go extract that shows how it is introduced:

[...]

type Server struct {
    conn          net.Listener
    numClientLock sync.Mutex
    numClients    int
}

[...]

    srv.numClientLock.Lock()
    srv.numClients += 1
    srv.numClientLock.Unlock()

[...]

func (srv *Server) handleClient(conn net.Conn) {
    srv.numClientLock.Lock()
    nc := srv.numClients
    srv.numClientLock.Unlock()
    io.WriteString(conn, fmt.Sprintf("Ciao, sei il client #%d che si connette a me\n", nc))

[...]

If you run the program now via go run -race counter_mutex.go and try to make subsequent connections, you will see that the race detector will no longer complain about the problem. For further information on using mutexes, you can read the sync.Mutex documentation. There are also other synchronisation primitives available, such as sync.RWMutex or sync.Once.

A small suggestion on this topic: it is always good to keep the locks for as little time as possible. In fact I preferred to isolate the read of the shared status in a separate statement, avoiding performing lock around the io.WriteString, which would keep it locked even during the entire network I/O.

Another possible approach in this specific case, as it is a competition on a simple integer variable, is to use the processor’s atomic instructions. This is the extract of counter_atomic.go which shows how to proceed:

[...]

type Server struct {
    conn       net.Listener
    numClients int64
}

[...]

    atomic.AddInt64(&srv.numClients, 1)

[...]

func (srv *Server) handleClient(conn net.Conn) {
    nc := atomic.LoadInt64(&srv.numClients)
    io.WriteString(conn, fmt.Sprintf("Ciao, sei il client #%d che si connette a me\n", nc))

[...]

In this case, we used the atomic.AddInt64 function to perform an atomic increment, while the atomic reading refers to atomic.LoadInt64. Atomic accesses are an interesting alternative to mutexes; they are much faster because they do not cause context switches. However, these primitives are somewhat complex to use, so it is better to use them only where actual performance problems are measured (often a rare condition); for further information, you can read the sync/atomic package documentation.

It is also interesting to note the power of the race detector in this case: if we try to leave the atomic.AddInt64 but remove the atomic.LoadInt64, a data race is in any case reported. This may seem obvious initially, but in reality it is not at all: in fact, on x86-64, while the atomic.AddInt64 is implemented through a special assembly instruction (LOCK XADDQ), the atomic.LoadInt64 is nothing but a normal access to the memory, because the x86-64 architecture ensures that 64-bit readings from memory are already atomic. As a result, the race detector is not only signalling a potential data race, but also a data race that can only occur on architectures other than on the one in which it is run, such as ARM32, where the reading of a 64-bit variable must necessarily take place with two different accesses to the memory, and therefore in a non-atomic way.

Conclusion

The race detector is, therefore, a very important weapon in every programmer’s arsenal, and Go programmers can sleep peacefully knowing they have one so powerful perfectly integrated into the standard toolchain and available at all times.

The race detector is only available today on 64-bit architectures; if you then use Go to compile on 32-bit embedded systems, such as ARM, you will unfortunately not be able to run the race detector natively on the target device. In this case, I always advise keeping the possibility of running the program (or at least a significant part of it) on your development system from the beginning of the development, so that you can benefit from this and numerous other features without having to move from execution to target each time.

For further information on the race detector, you can continue reading the official documentation.