reference: pytorch/pytorch#41162 (anonymous namespace)::indexing_backward_kernel from 30.5s to 3.80s
the same optimization as cirteo1tb.
core number = 12, tune parallel workers from 16 to 8, got 7-8s improvement.
before optimization: cudnn::ops::nchwToNhwcKernel, 2.11s, 15.4%
we reduce loss function time by torch.compile. before torch.compile: model_fn 21.9s, loss_fn 7.36s. after torch.compile: model_fn 21.9s, loss_fn 0.89s.