uber/aresdb

CUDA DeviceFree: out of memory error when building Aresdb first time

alxmrr opened this issue · 8 comments

Describe the issue
When running 'make run_server' to build version 0.0.2, the build fails with a DeviceFree: out of memory error after a few minutes. I am using a new server with no other processes running.

Reproduce the issue
NVIDIA driver version: 390.48
Cuda version: release 9.1, V9.1.85
golang version: 1.13
gcc version: 5.4.0
cmake version: 3.15.4

Follow the instructions to compile Aresdb version 0.0.2 through 'run make_server'

Error message
[ 15%] Built target mem
[100%] Built target algorithm
[100%] Built target lib
[100%] Built target aresd
Using config file:  config/ares.yaml
{"level":"info","msg":"Bootstrapping service","config":{"Port":9374,"DebugPort":43202,"RootPath":"ares-root","TotalMemorySize":161061273600,"SchedulerOff":false,"Version":"","Env":"","Query":{"DeviceMemoryUtilization":0.95,"DeviceChoosingTimeout":10,"TimezoneTable":{"TableName":"api_cities"},"EnableHashReduction":false},"DiskStore":{"WriteSync":true},"HTTP":{"MaxConnections":300,"ReadTimeOutInSeconds":20,"WriteTimeOutInSeconds":300},"RedoLogConfig":{"DiskConfig":{"Disabled":false},"KafkaConfig":{"Enabled":false,"Brokers":null,"TopicSuffix":""},"DiskOnlyForUnsharded":false},"Cluster":{"Enable":false,"Distributed":false,"Namespace":"","InstanceID":"","Controller":{"Address":"localhost:6708","Headers":null,"TimeoutSec":0},"Etcd":{"Zone":"local","Env":"dev","Service":"ares-datanode","CacheDir":"","ETCDClusters":[{"Zone":"local","Endpoints":["127.0.0.1:2379"],"KeepAlive":null,"TLS":null}],"SDConfig":{"InitTimeout":null},"WatchWithRevision":0},"HeartbeatConfig":{"Timeout":10,"Interval":1}}}}
panic: ERROR when calling CUDA functions: DeviceFree: out of memory
 
goroutine 1 [running]:
github.com/uber/aresdb/utils.StackError(0x0, 0x0, 0xc00004e040, 0x3d, 0x0, 0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/utils/error.go:61 +0x3f9
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xa7
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32
 
goroutine 1 [running]:
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xc1
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32
CMakeFiles/run_server.dir/build.make:57: recipe for target 'CMakeFiles/run_server' failed
make[3]: *** [CMakeFiles/run_server] Error 2
CMakeFiles/Makefile2:467: recipe for target 'CMakeFiles/run_server.dir/all' failed
make[2]: *** [CMakeFiles/run_server.dir/all] Error 2
CMakeFiles/Makefile2:474: recipe for target 'CMakeFiles/run_server.dir/rule' failed
make[1]: *** [CMakeFiles/run_server.dir/rule] Error 2
Makefile:298: recipe for target 'run_server' failed
make: *** [run_server] Error 2

what's the output of nvidia-smi in your environment?

Output of nvidia-smi:
 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:15:00.0 Off |                    0 |
| N/A   35C    P0    46W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   33C    P0    44W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0    42W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0    41W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   34C    P0    43W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    1      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    2      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    3      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    4      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    5      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    6      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    7      3413      G   /usr/lib/xorg/Xorg                             5MiB |
+-----------------------------------------------------------------------------+

that's wired.. the error is during initializing. AresDB has not copied anything to device memory yet.

looks like you were running on bare metal (whithout docker) but I was not able to reproduce. (I ran the same make rule with same cuda version and driver version, and was able to start the server properly).

several things I would try:

  1. see if you can run any sample cuda app.
  2. maybe try after killing xorg processes. (although it shouldn't matter)

I tried a number of CUDA samples, including bandwidthTest, deviceQuery, histogram, and more without error.

Linux version: Ubuntu 16.04.6 LTS

Could the trouble have to do with the Linux version, or is there a recommended version of nvidia and cuda to work with Ubuntu 16.04.6 LTS maybe?

not likely. I tested in the same linux version.

Still having trouble. Has anyone encountered this error from make test-cuda?
 
[----------] 4 tests from UnaryTransformTest
[ RUN      ] UnaryTransformTest.CheckInt
Exception happend when doing UnaryTransform:parallel_for failed: invalid device function
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: invalid device function
Aborted (core dumped)
CMakeFiles/test-cuda.dir/build.make:60: recipe for target 'CMakeFiles/test-cuda' failed
make[3]: *** [CMakeFiles/test-cuda] Error 134
make[3]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:388: recipe for target 'CMakeFiles/test-cuda.dir/all' failed
make[2]: *** [CMakeFiles/test-cuda.dir/all] Error 2
make[2]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:395: recipe for target 'CMakeFiles/test-cuda.dir/rule' failed
make[1]: *** [CMakeFiles/test-cuda.dir/rule] Error 2
make[1]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
Makefile:262: recipe for target 'test-cuda' failed
make: *** [test-cuda] Error 2

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

QUERY_MODE=DEVICE is the one you want to set with a GPU machine.
if QUERY_MODE is missing the make file will set it base on whether the build machine has GPU card (here)
The docker version should be in GPU mode because it's on top of nvidiadocker.