cuihenggang/geeps

Running Lenet with geeps on a single node

niketanpansare opened this issue · 1 comments

HI, I want to try geeps on single machine, and hence started with the lenet example. It is likely that I am missing any configuration or a step. Can you please help me with this?

$ ./data/mnist/get_mnist.sh
$ ./examples/mnist/create_mnist.sh
$ cat machinefile 
127.0.0.1
$ ./build/tools/caffe_geeps -gpu 0 -iterations 100 -machinefile machinefile -solver examples/mnist/lenet_solver.prototxt -ps_config examples/cifar10/2parts/ps_config_inception  train
I1212 20:33:14.118243 25158 caffe_geeps.cpp:184] Use solver examples/mnist/lenet_solver.prototxt.0
I1212 20:33:14.120550 25158 caffe_geeps.cpp:203] Use GPU with device ID 0
I1212 20:33:14.909091 25158 caffe_geeps.cpp:219] Starting Optimization
I1212 20:33:14.909188 25158 solver.cpp:68] Initializing solver from parameters: 
test_iter: 100
test_interval: 500
base_lr: 0.01
display: 100
max_iter: 1000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
solver_mode: CPU
net: "examples/mnist/lenet_train_test.prototxt"
I1212 20:33:14.909250 25158 solver.cpp:109] Creating training net from net file: examples/mnist/lenet_train_test.prototxt
I1212 20:33:14.912506 25162 db_lmdb.cpp:38] Opened lmdb examples/mnist/mnist_train_lmdb
I1212 20:33:14.912650 25158 data_layer.cpp:41] output data size: 64,1,28,28
I1212 20:33:14.913240 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 200704
I1212 20:33:14.913543 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 256
I1212 20:33:14.913594 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 200704
I1212 20:33:14.913661 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 256
I1212 20:33:14.913697 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 200704
I1212 20:33:14.913761 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 256
I1212 20:33:14.917830 25158 syncedmem.cpp:58] Allocate GPU data for UNINITIALIZED with size 2560
I1212 20:33:14.917896 25158 net.cpp:287] Network initialization done.
I1212 20:33:14.918046 25158 solver.cpp:193] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt
I1212 20:33:14.920621 25164 db_lmdb.cpp:38] Opened lmdb examples/mnist/mnist_test_lmdb
I1212 20:33:14.920717 25158 data_layer.cpp:41] output data size: 100,1,28,28
I1212 20:33:14.921351 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 313600
I1212 20:33:14.921653 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 400
I1212 20:33:14.921700 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 313600
I1212 20:33:14.921798 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 400
I1212 20:33:14.921839 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 313600
I1212 20:33:14.921926 25158 syncedmem.cpp:68] Allocate GPU data for HEAD_AT_CPU with size 400
I1212 20:33:14.925576 25158 syncedmem.cpp:58] Allocate GPU data for UNINITIALIZED with size 4000
I1212 20:33:14.925647 25158 net.cpp:287] Network initialization done.
I1212 20:33:14.925689 25158 solver.cpp:78] Solver scaffolding done.
I1212 20:33:14.925915 25158 solver.cpp:344] Blob #8 is an output blob
I1212 20:33:14.925933 25158 solver.cpp:485] Layer mnist doesn't need backward
local_row_keys_gpu.size() = 13606
local_row_keys_cpu.size() = 9460
row_keys_gpu.size() = 3370
row_keys_cpu.size() = 0
thread_cache_size = 8182779
I1212 20:33:19.171286 25158 solver.cpp:775] Virtual iteration done
I1212 20:33:19.173832 25158 solver.cpp:1617] Solving LeNet
I1212 20:33:19.173847 25158 solver.cpp:1618] Learning Rate Policy: inv
I1212 20:33:19.175861 25158 solver.cpp:890] Set initial parameter values done
I1212 20:33:19.177019 25158 solver.cpp:893] Iterations started
I1212 20:33:19.177233 25158 solver.cpp:1654] test_nets_.size() = 1
I1212 20:33:19.177250 25158 solver.cpp:1662] Iteration 0, Testing net (#0)
F1212 20:33:19.183192 25158 math_functions.cpp:92] Check failed: error == cudaSuccess (11 vs. 0)  invalid argument
*** Check failure stack trace: ***
    @     0x2b7eb7dce5cd  google::LogMessage::Fail()
    @     0x2b7eb7dd0433  google::LogMessage::SendToLog()
    @     0x2b7eb7dce15b  google::LogMessage::Flush()
    @     0x2b7eb7dd0e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x2b7eb7772392  caffe::caffe_copy<>()
    @     0x2b7eb77e6a72  caffe::BasePrefetchingDataLayer<>::Forward_gpu()
    @     0x2b7eb75f91d2  caffe::Layer<>::Forward()
    @     0x2b7eb778e217  caffe::SGDSolver<>::ForwardBackwardUsingPs()
    @     0x2b7eb779a645  caffe::Solver<>::Test()
    @     0x2b7eb779af76  caffe::Solver<>::TestAll()
    @     0x2b7eb77a1d03  caffe::Solver<>::Step()
    @     0x2b7eb77a2842  caffe::Solver<>::Solve()
    @           0x413b24  train()
    @           0x40fc3d  main
    @     0x2b7eb8c4d830  __libc_start_main
    @           0x4103d9  _start
    @              (nil)  (unknown)
Aborted```

@niketanpansare Sorry for not getting back to you in time. Do you still need help with this issue?