moby/datakit

"git commit" seems to break the transaction interface

djs55 opened this issue · 4 comments

djs55 commented

If I create an empty database:

mkdir database
cd database
git init .

then start datakit:

com.docker.db --git <path>/database  -url file:///var/tmp/foo

and then run the Go unit tests:

cd go-datakit
go test

-- everything passes.

If I make a commit to the database directory with git:

cd database
echo foo > bar
git add bar
git commit -m test

then (whether I restart the server or not) running the tests hangs committing a transaction:

$ go test
2017/09/27 22:17:38 Testing the client interface
2017/09/27 22:17:38 Dialling unix /var/tmp/foo
2017/09/27 22:17:38 Remove [branch master transactions foo rw a b c]
2017/09/27 22:17:38 Testing the configuration interface
2017/09/27 22:17:38 Dialling unix /var/tmp/foo

The test which triggers this seems to be:

$ go test -run TestTransaction
2017/09/27 22:18:00 Testing the transaction interface
2017/09/27 22:18:00 Dialling unix /var/tmp/foo

If I add some printfs:

$ git diff .
diff --git a/api/go-datakit/transaction_test.go b/api/go-datakit/transaction_test.go
index f59aaef..b49912f 100644
--- a/api/go-datakit/transaction_test.go
+++ b/api/go-datakit/transaction_test.go
@@ -24,8 +24,10 @@ func TestTransaction(t *testing.T) {
        if err != nil {
                t.Fatalf("Transaction.Write failed: %v", err)
        }
+       log.Println("Commit")
        err = trans.Commit(ctx, "Test transaction")
        if err != nil {
                t.Fatalf("Transaction.Commit failed: %v", err)
        }
+       log.Println("Done")
 }

then I get

$ go test -run TestTransaction
2017/09/27 22:19:09 Testing the transaction interface
2017/09/27 22:19:09 Dialling unix /var/tmp/foo
2017/09/27 22:19:09 Commit

which means that the Commit operation is hanging.

djs55 commented

This used to work with the very old version 6f04213 but seems broken with 97b3d23

Could you let me know what you think, @samoht @talex5?

djs55 commented

I've written a test script which demonstrates the problem. The script assumes that

  • your $GOPATH is $HOME/go and you have a checkout of github.com/moby/datakit there. Modify the top of the script to change this assumption
  • you run go get github.com/docker/go-p9p

The script is:

cat <<EOT > test-datakit.sh
#!/bin/sh

DB=${DB:-/Applications/Docker.app/Contents/MacOS/com.docker.db}
DATAKIT=${DATAKIT:-${HOME}/go/src/github.com/moby/datakit/api/go-datakit}

if [ ! -e "${DB}" ]; then
  echo "Datakit binary not found at ${DB}: please set DB environment variable"
  exit 1
fi
if [ ! -e "${DATAKIT}" ]; then
  echo "Datakit source tree not found at "${DATAKIT}". Note it has to be on the GOPATH".
  exit 1
fi

echo Using datakit from: "${DB}"
echo Using test cases from source tree: "${DATAKIT}"

database=$(mktemp -d)
echo Created temporary database dir "${database}"

DB_PID=0
TEST_PID=0
clean_up () {
    rm -rf "${database}"
    kill "${DB_PID}" 2> /dev/null
    kill "${TEST_PID}" 2> /dev/null
}

trap clean_up EXIT

pushd "${database}"
git init .
echo foo > bar
git add bar
git commit bar -m test
popd

"${DB}" --git "${database}" --url file:///var/tmp/foo&
DB_PID=$!

sleep 1 # easier than performing fd passing in a shell script

cd "${DATAKIT}"
go test -run TestTransaction 2>/dev/null >/dev/null &
TEST_PID=$!

retries=5
while [ "${retries}" -ne 0 ]; do
  if kill -0 "${TEST_PID}" 2> /dev/null; then
    echo Test is still running
    retries=$(( retries - 1 ))
    sleep 1
  else
    echo Test has finished
    echo Success
    exit 0
  fi
done

echo Test has timed out
echo Failure
exit 1
EOT

If you have a Docker for Mac install with a working (old) datakit, the test passes like so:

$ ./test-datakit.sh 
Using datakit from: /Applications/Docker.app/Contents/MacOS/com.docker.db
Using test cases from source tree: /Users/djs/go/src/github.com/moby/datakit/api/go-datakit
Created temporary database dir /var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.M04AzKw9
/var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.M04AzKw9 ~
Initialized empty Git repository in /private/var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.M04AzKw9/.git/
[master (root-commit) af2828f] test
 1 file changed, 1 insertion(+)
 create mode 100644 bar
~
Starting com.docker.db 0.7.0-34-g6f04213 ...
Test is still running
com.docker.db: [INFO] accepted a new connection on file:///var/tmp/foo
com.docker.db: [INFO] Using protocol TwoThousand msize 16384
Test has finished
Success

If you have the latest master datakit then it fails like so:

DB=~/go/src/github.com/docker/pinata/mac/build/Docker.app/Contents/MacOS/com.docker.db ./test-datakit.sh 
Using datakit from: /Users/djs/go/src/github.com/docker/pinata/mac/build/Docker.app/Contents/MacOS/com.docker.db
Using test cases from source tree: /Users/djs/go/src/github.com/moby/datakit/api/go-datakit
Created temporary database dir /var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.mRISPXGH
/var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.mRISPXGH ~
Initialized empty Git repository in /private/var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.mRISPXGH/.git/
[master (root-commit) 692e305] test
 1 file changed, 1 insertion(+)
 create mode 100644 bar
~
Starting com.docker.db %%VERSION%% ...
Test is still running
com.docker.db: [INFO] accepted a new connection on file:///var/tmp/foo
com.docker.db: [INFO] Using protocol TwoThousand msize 16384
Test is still running
Test is still running
Test is still running
Test is still running
Test has timed out
Failure
com.docker.db: [ERROR] Uncaught exception handling tag 72 Write(fid: 126, offset: 0, len(data): 6): 
                       Exception: (Sys_error
  "/var/folders/5q/xqlk9dzn5jz7_hn2j31brc0r0000gn/T/tmp.mRISPXGH/.git/lock/refs/heads: No such file or directory")
                       No backtrace available.
djs55 commented

I did a simple bisection of the binary PR builds from CircleCI and have:

My guess is that it was introduced in #529

I have identified the cause: there is a regression in the way we compare the contents of reference files. For some reasons, we do not trim the contents anymore, so <hash>\n -- which is created by git commit -- is in conflict with <hash> -- as understood by Datakit/Irmin: thus the transaction can never be resolved and enter in a loop.

I am trying to understand where does the regression come from exactly and will push a fix shortly. Thanks a lot for the nice reproduction case, it helped a lot to track that down!