dmlc/rabit

Rabit Architecture Diagram/Ports

mesanders opened this issue · 1 comments

Hello,

I need to setup an environment where there is a RabitTracker and multiple Rabit slave nodes. I need to do this in it's own environment, but I have to set up firewall rules for "security" reasons. It's clear what port the RabitTracker will be on by default, but what kind of ports need to be open to the Slaves? Do the slaves make calls to each other or does it only call the RabitTracker? It's unclear as there are limited docs/architectural diagrams. I looked through the source code, but it's not very clear.

Cheers,

Hi @mesanders I think yes, ASAIK slaves do call each other either in tree or ring fashion in allreduce and broadcast. It talked to tracker to get rank as well as connected links. I think this paper might be closest in term of diagram. https://pdfs.semanticscholar.org/cc42/7b070f214ad11f4b8e7e4e0f0a5bfa9d55bf.pdf

Regarding to port setting, I think it's intuitive in code, this might be good starts.

AllreduceBase::AllreduceBase(void) {
tracker_uri = "NULL";
tracker_port = 9000;
host_uri = "";
slave_port = 9010;
nport_trial = 1000;
rank = 0;
world_size = -1;
connect_retry = 5;
hadoop_mode = 0;
version_number = 0;
// 32 K items
reduce_ring_mincount = 32 << 10;
// tracker URL
task_id = "NULL";
err_link = NULL;
dmlc_role = "worker";
this->SetParam("rabit_reduce_buffer", "256MB");
// setup possible enviroment variable of intrest
env_vars.push_back("rabit_task_id");
env_vars.push_back("rabit_num_trial");
env_vars.push_back("rabit_reduce_buffer");
env_vars.push_back("rabit_reduce_ring_mincount");
env_vars.push_back("rabit_tracker_uri");
env_vars.push_back("rabit_tracker_port");
// also include dmlc support direct variables
env_vars.push_back("DMLC_TASK_ID");
env_vars.push_back("DMLC_ROLE");
env_vars.push_back("DMLC_NUM_ATTEMPT");
env_vars.push_back("DMLC_TRACKER_URI");
env_vars.push_back("DMLC_TRACKER_PORT");
env_vars.push_back("DMLC_WORKER_CONNECT_RETRY");
}