Some more improvements
vfdev-5 opened this issue · 3 comments
App explanation
Let's either create a tutorial guide show how to use the app or a simply button with a message explaining how to use the app, where to start etc.
Distributed:
- Done
Let's simplify this code is no distributed option is selected :
with idist.Parallel(
backend=config.backend,
nproc_per_node=config.nproc_per_node,
nnodes=config.nnodes,
node_rank=config.node_rank,
master_addr=config.master_addr,
master_port=config.master_port,
) as parallel:
parallel.run(run, config=config)
to
# (no dist)
with idist.Parallel(
backend=config.backend,
) as parallel:
parallel.run(run, config=config)
and
# single node
with idist.Parallel(
backend=config.backend,
nproc_per_node=config.nproc_per_node,
) as parallel:
parallel.run(run, config=config)
Readme
- Done
We should be very careful with distributed button and this suggestion
python -m torch.distributed.launch \
--nproc_per_node=2 \
--use_env main.py \
--backend="nccl"
as dist button will add the code to spawn processes inside the main process and dist launch will spawn more processes.
Let's do the following:
- add another checkbox with the option: use dist launch or spawn process
- if user picks "dist launch" -> README.md says to use:
python -m torch.distributed.launch --nproc_per_node=2 ...
and in the code we defineconfig.nproc_per_node=None
. Same for multi-node:config.master_addr=None
etc andpython -m torch.distributed.launch --nproc_per_node=2 --master_addr=master --master_port=1234 --nnodes=2 --node_rank=0 ...
- if user picks "spawn" -> README.md says :
python main.py ...
and in the code we defineconfig.nproc_per_node=2
.
- if user picks "dist launch" -> README.md says to use:
We can also imagine folks doing other things like here: https://github.com/sdesrozis/why-ignite
DataLoader
- Done
If user picks "spawn" option, we have to update the code like
train_dataloader = idist.auto_dataloader(
train_dataset,
batch_size=config.train_batch_size,
num_workers=config.num_workers,
shuffle=True,
persistent_workers=True
)
eval_dataloader = idist.auto_dataloader(
eval_dataset,
batch_size=config.eval_batch_size,
num_workers=config.num_workers,
shuffle=False,
persistent_workers=True
)
"Save the best model by eval score" and "Early stop ..."
- Done
It would be better to avoid such messages:
Please make sure to pass argument to metric_name parameter of get_handlers in main.py. Otherwise it can result KeyError.
Let's control what we are doing and configure everything such that we do not need to warn the user like that.
(Later) AMP mode as option ?
- Done
It would be nice to add AMP option for image classification / at least.
(Later) Optimizer type
- Done
User would like to choose optimizer type: Adam, RMSprop etc
AMP mode as option ?
AMP is already there. It can be enabled via config.use_amp