awnumar/memguard

open enclave dead lock

wodadehencou opened this issue · 6 comments

Describe the bug
Open enclave may cause RWMutex lock, caused by recursive rlock the RWMutex which is not permitted by golang.

in file: src/sync/rwmutex.go

// If a goroutine holds a RWMutex for reading and another goroutine might
// call Lock, no goroutine should expect to be able to acquire a read lock
// until the initial read lock is released. In particular, this prohibits
// recursive read locking. This is to ensure that the lock eventually becomes
// available; a blocked Lock call excludes new readers from acquiring the
// lock

To Reproduce
Steps to reproduce the behaviour:

  1. create a testcase that open the enclave parallel, maybe 20 goroutines.
  2. the process will locked in about 5 minutes, at Rlock() in func (s *Coffer) Destroyed() bool {
  3. this problem occurs in OSX and Linux

Expected behaviour
all process run normally

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS and Kernel Versions: OSX 10.15
  • Memguard Version: v0.19.1 & v0.22.1
  • Go Version: go1.12 & go1.14

Additional context
Use testcase below can generate the case. CPU usage start at about 100%, then decrease to 0%. now the process is locked.

package main

import (
	"bytes"
	"context"
	"crypto/rand"
	"errors"
	"fmt"
	"io"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/awnumar/memguard"
)

func main() {
	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	go func() {
		select {
		case <-sigs:
			cancel()
		}
	}()
	OpenEnclave(ctx)
}

func OpenEnclave(ctx context.Context) {
	n := 10
	data := make([][]byte, n)
	enclaves := make([]*memguard.Enclave, n)
	for i := range data {
		data[i] = make([]byte, 32)
		buf := make([]byte, 32)
		io.ReadFull(rand.Reader, buf)
		copy(data[i], buf)
		enclaves[i] = memguard.NewEnclave(buf)
	}

	threads := 20
	for i := 0; i < threads; i++ {
		j := 0
		go func(ctx context.Context) {
			for {
				select {
				case <-ctx.Done():
					return
				default:
					{
						// fmt.Printf("open enclave %d \n", j)
						immediateOpen(ctx, enclaves[j], data[j])
						j = (j + 1) % n
					}
				}
			}
		}(ctx)
	}
	<-ctx.Done()
	time.Sleep(time.Second)
}

func openVerify(lock *memguard.Enclave, exp []byte) error {
	lb, err := lock.Open()
	if err != nil {
		return err
	}
	defer lb.Destroy()
	if !bytes.Equal(lb.Bytes(), exp) {
		return errors.New("open verify fail")
	}
	return nil
}

func immediateOpen(ctx context.Context, lock *memguard.Enclave, exp []byte) {
	start := time.Now()
	c1 := make(chan error, 1)
	go func() {
		err := openVerify(lock, exp)
		c1 <- err
	}()
	var dur time.Duration
	select {
	case err := <-c1:
		{
			dur = time.Since(start)
			if err != nil {
				fmt.Printf("### open fail: %s \n", err)
			}
		}
	case <-ctx.Done():
		{
			dur = time.Since(start)
			fmt.Printf("### timeout \n")
		}
	}
	// fmt.Printf("%d, %d \n", start.UnixNano(), dur.Nanoseconds())
	_ = dur
}

Thanks for reporting. 👍

After applying the patch the behaviour seems identical. A bunch of timeouts after a period:

...
1587048652522845416, 36642766 
1587048652555571711, 3972773 
1587048652544642401, 14974528 
1587048652555173078, 4526586 
^C### timeout 
1587048652559703689, 27796593682 
### timeout 
1587048652559624289, 27796714889 
### timeout 
1587048652559552655, 27796801999 
...

I can not get the problem on my platform (go1.14, OSX), could you @awnumar please add two lines codes at the end of function OpenEnclave, print out the stack information.

         // ...
	<-ctx.Done()
	time.Sleep(time.Second)

	buf := make([]byte, 1<<20)
	fmt.Println(string(buf[:runtime.Stack(buf, true)])
}

I read all the codes of memguard, there are some other place have the same bug, I will make a MR as soon as possible

Here's the output: trace.txt
go version go1.14.2 linux/amd64

This is because the Purge() function, operation flow is
enclave.Open -> Key.View --(get RLock)--> NewBuffer -> memcall.Alloc[Fail] -> Panic -> Purge -> key.Lock (request write lock)

This is another dead lock case.
But I don't know why in your @awnumar system, memcall.Alloc was fail

A simple solution for all these dead lock cases, is to arrange all the Lock() & UnLock() carefully. But it will be less efficient, there will be more than ONE lock in a few of continuous operations.

Thanks for looking into this and providing a patch. I have merged your PR and added you to the authors file.

I can no longer reproduce the deadlock on my system. Your PoC does however produce some panics that are very interesting. I have added it to the examples submodule to investigate when I have more time.

Thanks again, and keep safe :)