RuntimeError when executing `python driver.py ...`
Closed this issue · 1 comments
When I execute nvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=docker0; cp ../runtime/launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-30T02:49:14 --partition vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path vgpus=4/mp_conf.json 2>&1 | tee output/2023-03-30T02:49:14/output.log.0; rm launch.py'
, and I got this error:
Traceback (most recent call last):
File "main_with_runtime.py", line 576, in <module>
main()
File "main_with_runtime.py", line 324, in main
train(train_loader, r, optimizer, epoch, lr_scheduler)
File "main_with_runtime.py", line 455, in train
pipelining(n, args.print_freq, weight_stash=True)
File "main_with_runtime.py", line 421, in pipelining
r.run_backward()
File "../runtime.py", line 624, in run_backward
for output_name in outputs]))
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4096, 1024]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
The environment is:
- Python==3.6.9
- Pytorch==1.5.0a0+8f84ded
- cuda==10.2
Do you know how to solve this problem? It would be so appreciated if you could help.
Hi, this issue is caused by a version upgrade of PyTorch since 1.5.0. You can (1) downgrade it to 1.4.0 or (2) refer to this issue I opened before:
I temporarily make pipedream run on latest PyTorch by eliminating the version check in unpack() in torch/csrc/autograd/saved_variable.cpp, it seems runtime errors come from this version checking (really dirty solution). I have not really understood pipedream's manipulation on the backward propagated gradients, but I guess this comes from one more in-place operation on the tensors passing between stages. I think this may help you solve this problem.
Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
if (!data_.defined()) {
if (!was_default_constructed_) {
throw std::runtime_error(ERR_BACKWARD_TWICE);
}
return Variable();
}
auto grad_fn = is_inplace_view_ ? weak_grad_fn_.lock() : grad_fn_;
if (has_grad_fn_ && !grad_fn) {
if (!saved_for) {
// If saving the grad_fn would create a circular reference, then it must
// be passed in to the unpack function.
throw std::runtime_error("No grad_fn for non-leaf saved variable");
}
grad_fn = std::move(saved_for);
}
if (saved_version_ != version_counter_.current_version()) {
std::stringstream message;
message << "one of the variables needed for gradient computation has been "
"modified by an inplace operation: [" << data_.toString() << " "
<< data_.sizes() << "]";
if (grad_fn) {
message << ", which is output " << output_nr_
<< " of " << grad_fn->name() << ",";
}
message << " is at version " << version_counter_.current_version()
<< "; expected version " << saved_version_ << " instead.";
if (!AnomalyMode::is_enabled()) {
message << " Hint: enable anomaly detection to find the operation "
"that failed to compute its gradient, with torch.autograd."
"set_detect_anomaly(True).";
}
else {
message << " Hint: the backtrace further above shows the operation "
"that failed to compute its gradient. The variable in question "
"was changed in there or anywhere later. Good luck!";
}
throw std::runtime_error(message.str());
}