dgschwend/zynqnet

Error when running the Train_Caffenet.sh

Closed this issue · 6 comments

Hello,

I am using caffe and GPU Tesla K80 to train your model. This is what I get every time I run the train script. Any idea what could be the issue? Do I need more than one GPU to train?

I0529 11:25:57.548988 73168 blocking_queue.cpp:49] Waiting for data
I0529 11:28:21.023334 73174 data_layer.cpp:73] Restarting data prefetching from start.
I0529 11:28:23.880893 73168 solver.cpp:418] Test net output #0: accuracy = 0.000996492
I0529 11:28:23.880945 73168 solver.cpp:418] Test net output #1: accuracy_top5 = 0.00476323
I0529 11:28:23.880959 73168 solver.cpp:418] Test net output #2: loss = 6.93271 (* 1 = 6.93271 loss)
F0529 11:28:24.432736 73168 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x2aaaac012e6d (unknown)
@ 0x2aaaac014ced (unknown)
@ 0x2aaaac012a5c (unknown)
@ 0x2aaaac01563e (unknown)
@ 0x2aaaaaf41140 caffe::SyncedMemory::mutable_gpu_data()
@ 0x2aaaaade3382 caffe::Blob<>::mutable_gpu_data()
@ 0x2aaaaaf84ef0 caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x2aaaaaf0dfac caffe::Net<>::ForwardFromTo()
@ 0x2aaaaaf0e387 caffe::Net<>::Forward()
@ 0x2aaaaaf2bc4f caffe::Solver<>::Step()
@ 0x2aaaaaf2c44f caffe::Solver<>::Solve()
@ 0x40a727 train()
@ 0x407ebc main
@ 0x2aaabcc76c05 __libc_start_main
@ 0x408703 (unknown)
./examples/imagenet/train_caffenet.sh: line 5: 73168 Aborted (core dumped) ./build/tools/caffe train --solver=/home-new/aup019/zynqnet/_TRAINED_MODEL/solver.prototxt $@

It is a shared machine but when I checked the status is says no running process found.
I reduced the batch size and could get it to run. But I'm guessing that would affect the accuracy?

This is the output from nvidia-smi. I obtained this while training the model.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:06:00.0 Off | 0 |
| N/A 53C P0 116W / 149W | 8747MiB / 11439MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:07:00.0 Off | 0 |
| N/A 47C P8 32W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:0A:00.0 Off | 0 |
| N/A 28C P8 27W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:0B:00.0 Off | 0 |
| N/A 32C P8 29W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:86:00.0 Off | 0 |
| N/A 32C P8 26W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:87:00.0 Off | 0 |
| N/A 35C P8 30W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:8A:00.0 Off | 0 |
| N/A 34C P8 27W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:8B:00.0 Off | 0 |
| N/A 35C P8 29W / 149W | 1MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 115307 C ./build/tools/caffe 8734MiB |
+-----------------------------------------------------------------------------+

Right! Thank you for your help! :)