kubeflow/pytorch-operator

What is the difference between master and worker?

SeibertronSS opened this issue · 6 comments

Hello everyone,
Recently I am using pytorchjob to complete some work, but I have a question as described below. What are the capabilities of the master and worker in pytorchjob? What is the difference between master and worker?

Ref https://pytorch.org/docs/stable/distributed.html

This method will read the configuration from environment variables, allowing one to fully customize how the information is obtained. The variables to be set are:

MASTER_PORT - required; has to be a free port on machine with rank 0

MASTER_ADDR - required (except for rank 0); address of rank 0 node

WORLD_SIZE - required; can be set either here, or in a call to init function

RANK - required; can be set either here, or in a call to init function

When you use torch distributed, you need to specify master-addr and port:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)

In PyTorch operator, we will inject the address and port of the master pod to the workers' environment variables.

Thank you very much for your answers.

how to change it when i want to use multi nodes? And I find that in kubeflow WORLD_SIZE is different from the means in pytorch

MASTER will open an TCP port and acts as MATSER_ADDR

How can i use it, I find that if I use 2 nodes and each node have 2 gpus, the task only laucher once. And I do not know how to use multi nodes to training network. Can you help me, thanks.

You could open an issue at https://github.com/kubeflow/training-operator about how you run the job

The information is limited, thus I do not know how to help you.