official-stockfish/nnue-pytorch

setting --num-workers has no effect

protonspring opened this issue · 1 comments

I'm able to train networks now, but everytime I train something, I get:

Sanity Checking: 0it [00:00, ?it/s]/store/dev/chess/nnue-pytorch/nnue-pytorch/trainer/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 16 which is the number of cpus on this machine) in theDataLoader` init to improve performance.

My training script is:
python3 train.py ./training_data/nodes5000pv2_UHO.binpack ./training_data/nodes5000pv2_UHO.binpack --gpus "0," --threads 4 --num-workers 4 --batch-size 16384 --enable_progress_bar --features=HalfKAv2_hm^ --lambda=1.0 --max_epochs=400 --default_root_dir ./training_data/runs/run_0

I output the num-workers in the train.py script and it is always == 0. No matter what value I use in my training script, the result is always 0 and I get the sanity checking error.

I also see these comments in train.py:

num_workers has to be 0 for sparse, and 1 for dense
it currently cannot work in parallel mode but it shouldn't need to

So, I'm generally confused.

Is something broken, or am I doing something wrong?

These warnings are because we circumvent the default data loader path to be able to load the data efficiently. I.e. we do parallelism at a different level than pytorch-lightning workers, --num-workers controls that instead of the pytorch-lightning one. They can be ignored, they relate to unused parts of pytorch-lightning. Basically a spurious warning that we can't suppress.