dmlc/ps-lite

Why not using RDMA WRITE( with imm) or RDMA READ to avoid additional memory copy?

zpcalan opened this issue · 9 comments

Hi developers of dmlc.
Recently I've been working with ps-lite. I find out that in the operation of KVWorker::Pull, there maybe some memory copy operations that may cause performance overhead:

template <typename Val>
void KVServer<Val>::Response(const KVMeta& req, const KVPairs<Val>& res) {
  Message msg;
  msg.meta.app_id = obj_->app_id();
  msg.meta.customer_id = req.customer_id;
  msg.meta.request     = false;
  msg.meta.push        = req.push;
  msg.meta.pull        = req.pull;
  msg.meta.head        = req.cmd;
  msg.meta.timestamp   = req.timestamp;
  msg.meta.recver      = req.sender;
  if (res.keys.size()) {
    msg.AddData(res.keys);
    msg.AddData(res.vals);
    if (res.lens.size()) {
      msg.AddData(res.lens);
    }
  }
  Postoffice::Get()->van()->Send(msg);
}

In my opinion, code msg.AddData holds back the communication between woker and server.
If using RDMA WRITE( with imm) or RDMA READ, we can exchange the address by out-of-band communication in the beginning, then do PULL and PUSH on these addresses.
This is just my thought. Please correct me if I'm wrong or I miss some details.

By the way, the function Van::PackMetaPB used by IBVERBS misses pb->set_pull(meta.pull); which is in the function of Van::PackMeta used by ZMQ. This will cause the error when running the test case ps-lite/tests/test_kv_app.cc using IBVERBS.

@zpcalan msg.AddData does not involve extra copy. It only operates on SArray, which is a shared pointer that points to the actual message address. In ibverbs_van.h, we implement RDMA write so the sending is zero copy, like what you said.

@ymjiang Thanks for your reply! I will go deep into the code later. For now I have two questions:
1.Do you mean that the time cost on framework would be very little in both pull and push?(Just to confirm)
2.Do you think it's neccessary to add pb->set_pull(meta.pull); in function Van::PackMetaPB to avoid error when running your test case?

1.Do you mean that the time cost on framework would be very little in both pull and push?(Just to confirm)

The overhead of push/pull is mainly in network transmission (not in ps-lite processing). Networking overhead could be large, depending on your workload. If you use RDMA rather than TCP, the overhead will be lower. You can refer to the comparison between ZeroMQ and the IBverbs implementation here: #124 (comment). In this test, we use a communication-intensive workload to show the effectiveness of RDMA.

2.Do you think it's neccessary to add pb->set_pull(meta.pull); in function Van::PackMetaPB to avoid error when running your test case?

Could you paste the error here? Does it fix the error if you add pb->set_pull(meta.pull);?

Thanks for your tips for question 1!
As for question 2, I just write a simple demo according to your test case test_kv_app.cc. I also did slight modification to launch script local.sh to run on two nodes. The scheduler and server run on Node1.The worker run on Node2. (Both of them use IBVERBS)
test.cc:

#include <cmath>
#include "ps/ps.h"
#include <unistd.h>

using namespace ps;

void StartServer() {
  if (!IsServer()) {
    return;
  }
  auto server = new KVServer<float>(0);
  server->set_request_handle(KVServerDefaultHandle<float>());
  RegisterExitCallback([server](){ delete server; });
}

void RunWorker() {
  if (!IsWorker()) return;
  KVWorker<float> kv(0, 0);

  std::vector<uint64_t> key = {1, 2, 3};
  std::vector<float> val = {1, 1, 1};
  std::vector<float> recv_val;
  kv.Wait(kv.Push(key, val));
  kv.Wait(kv.Pull(key, &recv_val));

}

int main(int argc, char *argv[]) {
  // start system
  Start(0);
  // setup server nodes
  StartServer();
  // run worker nodes
  RunWorker();
  // stop system
  Finalize(0, true);
  return 0;
}

local.sh

!/bin/bash
# set -x
if [ $# -lt 3 ]; then
    echo "usage: $0 num_servers num_workers bin [args..]"
    exit -1;
fi

export DMLC_NUM_SERVER=$1
shift
export DMLC_NUM_WORKER=$1
shift
bin=$1
shift
arg="$@"

export DMLC_PS_ROOT_URI='ib'
export DMLC_PS_ROOT_PORT=8000

if [ "$HOSTNAME" == "node1" ];then
# start the scheduler
export DMLC_ROLE='scheduler'
${bin} ${arg} &

# start servers
export DMLC_ROLE='server'
for ((i=0; i<${DMLC_NUM_SERVER}; ++i)); do
    #export HEAPPROFILE=./S${i}
    ${bin} ${arg} &
done
fi

if [ "$HOSTNAME" == "node2" ];then
# start workers
export DMLC_ROLE='worker'
for ((i=0; i<${DMLC_NUM_WORKER}; ++i)); do
    #export HEAPPROFILE=./W${i}
    ${bin} ${arg} &
done
fi

wait

How to run: export PS_VERBOSE=1;export DMLC_PS_VAN_TYPE=ibverbs;export DMLC_INTERFACE=ib1;./local.sh 1 1 ./build/ps_demo

And I get the error in server side:

ps-lite/include/ps/kv_app.h:447: Check failed: (n) == (req_data.vals.size())

Stack trace returned 10 entries:
[bt] (0) ./build/ps_demo() [0x4086bb]
[bt] (1) ./build/ps_demo() [0x40895d]
[bt] (2) ./build/ps_demo() [0x410ca3]
[bt] (3) ./build/ps_demo() [0x40cd93]
[bt] (4) ./build/ps_demo() [0x410790]
[bt] (5) ./build/ps_demo() [0x40c925]
[bt] (6) ./build/ps_demo() [0x41d703]
[bt] (7) ./build/ps_demo() [0x41af67]
[bt] (8) ./build/ps_demo() [0x417ca5]
[bt] (9) ./build/ps_demo() [0x4145ab]
.........

After I add pb->set_pull(meta.pull), the program runs without error.

@zpcalan Thank you for the reports. Looks like it is a bug. But the problem does not seem to be in IBverbs. If you disable IBverbs, does it also happen or not?

No, there's no problem when I use ZMQ.

By the way, the function Van::PackMetaPB used by IBVERBS misses pb->set_pull(meta.pull); which is in the function of Van::PackMeta used by ZMQ.

@zpcalan I see, sorry for overlooking this. Thank you very much for finding this bug. Would you create a PR to fix PackMetaPB?

@ymjiang You are welcome.
I will create a PR but maybe couple of weeks later because I want to know well about ps-lite before I create this PR in case that I import some other bugs : )