What is the difference between master and worker?

Question

What is the difference between master and worker?

SeibertronSS opened this issue 3 years ago · 6 comments

Hello everyone,
Recently I am using pytorchjob to complete some work, but I have a question as described below. What are the capabilities of the master and worker in pytorchjob? What is the difference between master and worker?

Answer 1 · 2021-07-08T07:15:38.000Z

Ref https://pytorch.org/docs/stable/distributed.html

This method will read the configuration from environment variables, allowing one to fully customize how the information is obtained. The variables to be set are:

MASTER_PORT - required; has to be a free port on machine with rank 0

MASTER_ADDR - required (except for rank 0); address of rank 0 node

WORLD_SIZE - required; can be set either here, or in a call to init function

RANK - required; can be set either here, or in a call to init function

When you use torch distributed, you need to specify master-addr and port:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)

In PyTorch operator, we will inject the address and port of the master pod to the workers' environment variables.

Answer 2 · 2021-07-08T13:17:05.000Z

Thank you very much for your answers.

Answer 3 · 2021-11-20T18:50:53.000Z

how to change it when i want to use multi nodes? And I find that in kubeflow WORLD_SIZE is different from the means in pytorch

Answer 4 · 2021-11-21T01:13:26.000Z

MASTER will open an TCP port and acts as MATSER_ADDR

Answer 5 · 2021-11-21T05:05:34.000Z

How can i use it, I find that if I use 2 nodes and each node have 2 gpus, the task only laucher once. And I do not know how to use multi nodes to training network. Can you help me, thanks.

Answer 6 · 2021-11-21T11:18:55.000Z

You could open an issue at https://github.com/kubeflow/training-operator about how you run the job

The information is limited, thus I do not know how to help you.