Basic performance testing
jacobsa opened this issue · 14 comments
We should do some performance testing to identify completely obvious bottlenecks.
- Run on GCE, on a mid-range machine.
- Generate random data of various sizes (1 KiB, 1 MiB, 10 MiB, 100 MiB, 1 GiB, 10 GiB?)
- Use
time cp
to measure wall time taken to copy from local disk to GCS. - Find some way to measure CPU time taken by the gcsfuse process during the operation, too.
- Look at throughput in terms of bytes per wall second and bytes per CPU second.
- As a baseline, compare both to
gsutil cp
. (Both can be measured here by just usingtime gsutil cp
, I think.)
Setting up in us-central1-a
on n1-standard-4
with image backports-debian-7-wheezy-v20150320
and standard persistent disk.
Setup notes
Create an instance with a built-in key that can access GCS:
gcloud compute instances create test --scopes storage-full --image debian-7-backports --machine-type n1-standard-4 --project github-jacobsa-gcloud --zone us-central1-a
Copy in key file:
gcloud compute --project github-jacobsa-gcloud copy-files --zone us-central1-a '/Users/jacobsa/Downloads/GCloud GitHub project-bc05fbd617c4.json' test:
ssh in and set up the shell:
gcloud compute --project github-jacobsa-gcloud ssh --zone us-central1-a test
set -o vi
Install and set up fuse:
sudo apt-get update && sudo apt-get install fuse
sudo adduser $USER fuse
sudo chmod g+rw /dev/fuse
sudo chgrp fuse /dev/fuse
Log out and back in so the group change takes effect:
exit
[...]
gcloud compute --project github-jacobsa-gcloud ssh --zone us-central1-a test
set -o vi
Install Go:
wget https://storage.googleapis.com/golang/go1.4.2.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.4.2.linux-amd64.tar.gz
export GOPATH=$HOME/go
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$GOPATH/bin
Install Git:
sudo apt-get update && sudo apt-get -y install git-core
Install gcsfuse:
go get -v github.com/googlecloudplatform/gcsfuse
Test notes
Generate some data:
dd if=/dev/urandom of=foo bs=1048576 count=1 # 1 MiB
dd if=/dev/urandom of=foo bs=1048576 count=100 # 100 MiB
dd if=/dev/urandom of=foo bs=1048576 count=1024 # 1 GiB
gcsfuse
Mount gcsfuse:
mkdir -p mp
gcsfuse --bucket jacobsa-gcs-integration-test --key_file GCloud\ GitHub\ project-bc05fbd617c4.json mp &
Copy in the file:
rm -f mp/foo
time cp foo mp/
gsutil
gsutil rm -f gs://jacobsa-gcs-integration-test/foo
time gsutil cp foo gs://jacobsa-gcs-integration-test/
Results from a few runs
100 MiB
gcsfuse
real 0m8.240s
user 0m0.020s
sys 0m0.952s
real 0m6.245s
user 0m0.016s
sys 0m0.732s
real 0m6.353s
user 0m0.000s
sys 0m0.748s
gsutil
real 0m2.620s
user 0m1.548s
sys 0m0.672s
real 0m3.036s
user 0m1.552s
sys 0m0.700s
real 0m2.965s
user 0m1.384s
sys 0m0.684s
1 GiB
gcsfuse
real 1m5.938s
user 0m0.048s
sys 0m9.492s
real 1m1.468s
user 0m0.040s
sys 0m8.384s
gsutil
real 0m16.926s
user 0m6.420s
sys 0m2.832s
real 0m20.855s
user 0m6.408s
sys 0m2.928s
So pretty slow, but not as terribly slow as I had feared. Some things to look at:
- Is it possible
cp
is doing something braindead here? Try usingdd
instead, where we can control the block size. - What proportion of the time spent is writing to the local file vs. uploading vs. everything else? Add logging to find out.
- Confirm that the kernel is sending 4 KiB writes, which is probably terrible, then fix.
Trying with dd:
# <Generate 1 GiB foo>
rm -f mp/foo
time dd if=foo of=mp/foo bs=1048576
which gives:
real 0m58.100s
user 0m0.012s
sys 0m8.796s
which is not much different. Only slightly better with a block size of 10x that, on a single run:
real 0m53.033s
user 0m0.008s
sys 0m7.824s
Confirmed: writes are 4 KiB.
Hacked together a binary that uses bazilfuse.InitBigWrites
in the response to fuse's init request, which appears to result in 32 KiB writes from the kernel.
New results for 1 GiB:
real 0m43.312s
user 0m0.052s
sys 0m2.656s
real 0m37.229s
user 0m0.064s
sys 0m1.980s
Better.
Surprisingly, InitWritebackCache
plus big writes appears to make things much worse:
real 1m42.914s
user 0m0.012s
sys 0m0.220s
Hacked in some logging that shows that with 32 KiB most of the time is spent in ObjectProxy.Sync, which is doing the network write:
2015/03/25 05:05:47.256616 object_proxy.go:244: WriteAt called
2015/03/25 05:05:47.256658 object_proxy.go:412: ensureLocalFile called
2015/03/25 05:05:47.293666 object_proxy.go:423: ensureLocalFile returning
2015/03/25 05:05:47.293758 object_proxy.go:259: WriteAt returning
2015/03/25 05:05:52.133747 object_proxy.go:296: Sync called
2015/03/25 05:06:29.445792 object_proxy.go:342: Sync returning
real 0m42.493s
user 0m0.044s
sys 0m1.908s
Now that's actually a bit odd to me. What is our io.Copy
doing that gsutil is not? Serving from a bytes.Reader
rather than the file doesn't seem to make a difference.
FWIW, here is the CPU profile while calling CreateObject:
flat flat% sum% cum cum%
8900ms 48.79% 48.79% 8900ms 48.79% crypto/cipher.(*gcm).mul
2010ms 11.02% 59.81% 2010ms 11.02% crypto/aes.encryptBlockAsm
1960ms 10.75% 70.56% 1960ms 10.75% runtime.memmove
760ms 4.17% 74.73% 760ms 4.17% crypto/cipher.getUint64
560ms 3.07% 77.80% 10200ms 55.92% crypto/cipher.(*gcm).updateBlocks
540ms 2.96% 80.76% 540ms 2.96% runtime.futex
520ms 2.85% 83.61% 560ms 3.07% syscall.Syscall
480ms 2.63% 86.24% 2640ms 14.47% crypto/aes.(*aesCipher).Encrypt
460ms 2.52% 88.76% 3760ms 20.61% crypto/cipher.(*gcm).counterCrypt
310ms 1.70% 90.46% 310ms 1.70% crypto/cipher.fastXORWords
Although gcsfuse CPU usage hovers at about 50%, so that's probably not the bottleneck.
To do for alpha: Document these numbers, set performance expectations.
Switched to resumable object uploads in jacobsa/gcloud@3b1af22, which seems to make a bit of a difference to upload speed. As far as I can tell, this effect is responsible for most of the remaining delta with gsutil. (Ignoring the unavoidable double-copy overhead when staging the temporary file in gcsfuse.)
To do:
- Wait for jacobsa/gcsfuse#27 (increasing kernel write size).
- Try the test above again.
- Set performance expectations in the docs.
- Call it good enough or not. If good enough, close this bug. Either way remove from alpha milestone.
Issue #27 is fixed.
Tried again with 1 GiB:
real 0m27.226s
user 0m0.064s
sys 0m2.152s
real 0m24.109s
user 0m0.040s
sys 0m1.736s
real 0m25.231s
user 0m0.048s
sys 0m1.616s
Much much better, and proabably good enough for now. Let's document that performance is comparable with gsutil, and wait for specific complaints.