setting --num-workers has no effect
protonspring opened this issue · 1 comments
I'm able to train networks now, but everytime I train something, I get:
Sanity Checking: 0it [00:00, ?it/s]/store/dev/chess/nnue-pytorch/nnue-pytorch/trainer/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 16 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.
My training script is:
python3 train.py ./training_data/nodes5000pv2_UHO.binpack ./training_data/nodes5000pv2_UHO.binpack --gpus "0," --threads 4 --num-workers 4 --batch-size 16384 --enable_progress_bar --features=HalfKAv2_hm^ --lambda=1.0 --max_epochs=400 --default_root_dir ./training_data/runs/run_0
I output the num-workers in the train.py script and it is always == 0. No matter what value I use in my training script, the result is always 0 and I get the sanity checking error.
I also see these comments in train.py:
num_workers has to be 0 for sparse, and 1 for dense
it currently cannot work in parallel mode but it shouldn't need to
So, I'm generally confused.
Is something broken, or am I doing something wrong?
These warnings are because we circumvent the default data loader path to be able to load the data efficiently. I.e. we do parallelism at a different level than pytorch-lightning workers, --num-workers
controls that instead of the pytorch-lightning one. They can be ignored, they relate to unused parts of pytorch-lightning. Basically a spurious warning that we can't suppress.