shuaimu/rococo

The run.py sometimes fails to connect to clients

Opened this issue · 4 comments

Hi Mengxing, can you help on solve this problem? I guess it might be some trivial issue, but it is annoying when running benchmarks, and for now i do not have enough time to fix this.

shuai@ubuntu:~/workspace/rococo$ ./run.py -f config.local/tpccd/rcc_1core.xml
Port: 5555
Start reading config file: /home/shuai/workspace/rococo/config.local/tpccd/rcc_1core.xml ...
Done
Checking site info ...
('127.0.0.1', 'node1')
Done
Checking client info ...
127.0.0.1
Done
No taskset, auto scheduling
Starting servers ...
cd /home/shuai/workspace/rococo; nohup ./build/deptran_server -s 0 -f /home/shuai/workspace/rococo/config.local/tpccd/rcc_1core.xml -p 5555 -H /home/shuai/workspace/rococo/config/hosts -t 10 -b 1>"/home/s
huai/workspace/rococo/log/site-0.log" 2>"/home/shuai/workspace/rococo/log/site-0.err" &
Servers started ...
Waiting for server init ...
E [client.cc:139] 2015-03-12 17:48:53.113 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.214 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.315 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.416 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.517 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.618 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.719 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.820 | rrr::Client: connect(127.0.0.1:5555): Connection refused
E [client.cc:139] 2015-03-12 17:48:53.921 | rrr::Client: connect(127.0.0.1:5555): Connection refused

I will try.
When I run "./run.py -f config.local/tpccd/rcc_1core.xml", it does not work and exit immediately. And I find it is beacuse there is no file named "./config/hosts", while there is a "/config/hosts-local" instead. After I modified this option, run.py could start server and client and connect successfully, but there is no benchmark output. I have try many times and the result is same:

lmx@lmx-pc:~/codes/rococo$ ./run.py -H config/hosts-local -f config.local/tpccd/rcc_1core.xml
Port: 5555
Start reading config file: /home/lmx/codes/rococo/config.local/tpccd/rcc_1core.xml ...
Done
Checking site info ...
('127.0.0.1', 'node1')
Done
Checking client info ...
127.0.0.1
Done
No taskset, auto scheduling
Starting servers ...
cd /home/lmx/codes/rococo; nohup ./build/deptran_server -s 0 -f /home/lmx/codes/rococo/config.local/tpccd/rcc_1core.xml -p 5555 -H /home/lmx/codes/rococo/config/hosts-local -t 10 -b 1>"/home/lmx/codes/rococo/log/site-0.log" 2>"/home/lmx/codes/rococo/log/site-0.err" &
Servers started ...
Waiting for server init ...
E [client.cc:139] 2015-03-12 20:12:30.876 | rrr::Client: connect(127.0.0.1:5555): Connection refused
D [client.cc:144] 2015-03-12 20:12:30.976 | rrr::Client: connected to 127.0.0.1:5555
All site have finished initialization!
Starting clients ...
cd /home/lmx/codes/rococo; nohup ./build/deptran_client -c 0 -d 60 -f /home/lmx/codes/rococo/config.local/tpccd/rcc_1core.xml -p 5556 -t 5 -H /home/lmx/codes/rococo/config/hosts-local -S 0 -b 1>"/home/lmx/codes/rococo/log/client-0.log" 2>"/home/lmx/codes/rococo/log/client-0.err" &
Clients started ...
E [client.cc:139] 2015-03-12 20:12:33.154 | rrr::Client: connect(127.0.0.1:5556): Connection refused
D [client.cc:144] 2015-03-12 20:12:33.254 | rrr::Client: connected to 127.0.0.1:5556
Clients all ready
Clients started
Shutting down servers ...
SERVERKILLED
Force clients shutdown ...
Shutting down clients ...
Clients shutdown
Clients killed
Benchmark finished

try change the mode from "ro6" to "2pl" in the config.xml

I tried on teaker machines. Indeed there are very few cases this bug happens. But in my virtual machine it happens a lot. Can you maybe try a virtualbox virtual machine with a single core and see what happens? @liumx10

The problem is repeatable when deployed on a single machine using "localhost"