CUDA error when I Training
wzgliang opened this issue · 1 comments
wzgliang commented
Environment:cuda11.0,torch1.5.0
when i start train by python scripts/train.py with data_splits.train=all_train train_params.save_every_epoch=True train_params.num_epochs=6
the terminal raise CUDA error(as the text):
WARNING - root - Changed type of config entry "data_splits.train" from list to str
WARNING - train - No observers have been added to this run
INFO - train - Running command 'main'
INFO - train - Started
Configuration (modified, added, typechanged, doc):
add_date = True
ckpt_path = 'trained_models/graph_nets/mot_mpnet_epoch_006.ckpt'
cross_val_split = None
run_id = 'train_w_default_config'
seed = 672080547 # the random seed for this experiment
data_splits:
test = ['mot15_test', 'mot17_test']
train = 'all_train'
val = []
dataset_params:
GT_train_max_iou_containment_thresh = 0.85
GT_train_max_iou_thresh = 0.75
augment = True
det_file_name = 'tracktor_prepr_det'
edge_feats_to_use = ['secs_time_dists',
'norm_feet_x_dists',
'norm_feet_y_dists',
'bb_height_dists',
'bb_width_dists',
'emb_dist']
frames_per_graph = 15
gt_assign_min_iou = 0.5
gt_training_min_vis = 0.2
img_batch_size = 5000
img_size = [128, 64]
max_detects = 500
max_detects_to_drop_perc = 0.3
max_frame_dist = 'max'
max_ids_to_drop_perc = 0.15
min_detects = 25
min_detects_to_drop_perc = 0
min_ids_to_drop_perc = 0
min_iou_bb_wiggling = 0.8
node_embeddings_dir = 'resnet50_conv'
overwrite_processed_data = False
p_change_fps_step = 0.5
precomputed_embeddings = True
reciprocal_k_nns = True
reid_embeddings_dir = 'resnet50_w_fc256'
top_k_nns = 50
target_fps_dict:
moving = 9
static = 6
eval_params:
add_tracktor_detects = True
best_method_criteria = 'idf1'
check_val_every_n_epoch = 9999
log_per_seq_metrics = False
max_dets_per_graph_seq = 40000
metrics_to_log = ['loss', 'precision', 'recall', 'constr_sr']
min_track_len = 2
mot_metrics_to_log = ['mota',
'norm_mota',
'idf1',
'norm_idf1',
'num_switches',
'num_misses',
'num_false_positives',
'num_fragmentations',
'constr_sr']
mot_metrics_to_norm = ['mota', 'idf1']
normalize_mot_metrics = True
rounding_method = 'exact'
set_pruned_edges_to_inactive = False
solver_backend = 'pulp'
tensorboard = False
use_tracktor_start_ends = True
val_percent_check = 0
graph_model_params:
node_agg_fn = 'sum'
num_class_steps = 11
num_enc_steps = 12
reattach_initial_edges = True
reattach_initial_nodes = False
classifier_feats_dict:
dropout_p = 0
edge_fc_dims = [8]
edge_in_dim = 16
edge_out_dim = 1
use_batchnorm = False
cnn_params:
arch = 'resnet50'
model_weights_path:
resnet50 = 'trained_models/reid/resnet50_market_cuhk_duke.tar-232'
edge_model_feats_dict:
dropout_p = 0
fc_dims = [80, 16]
use_batchnorm = False
encoder_feats_dict:
dropout_p = 0
edge_fc_dims = [18, 18]
edge_in_dim = 6
edge_out_dim = 16
node_fc_dims = [128]
node_in_dim = 2048
node_out_dim = 32
use_batchnorm = False
node_model_feats_dict:
dropout_p = 0
fc_dims = [56, 32]
use_batchnorm = False
train_params:
batch_size = 8
num_epochs = 6
num_workers = 6
save_epoch_start = 1
save_every_epoch = True
tensorboard = False
lr_scheduler:
type = None
args:
gamma = 0.5
step_size = 7
optimizer:
type = 'Adam'
args:
lr = 0.001
weight_decay = 0.0001
Successfully loaded pretrained weights from "/root/mot_neural_solver/output/trained_models/reid/resnet50_market_cuhk_duke.tar-232"
** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias']
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING - lightning - No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO - lightning - CUDA_VISIBLE_DEVICES: [0]
Detections for sequence MOT17-02-GT need to be processed. Starting processing
Finished processing detections for seq MOT17-02-GT. Result was stored at /root/mot_neural_solver/data/MOT17Labels/train/MOT17-02-GT/processed_data/det/gt.pkl
Found existing stored node embeddings. Deleting them and replacing them for new ones
Found existing stored reid embeddings. Deleting them and replacing them for new ones
Computing embeddings for 20130 detections
ERROR - train - Failed after 0:00:18!
Traceback (most recent calls WITHOUT Sacred internals):
File "scripts/train.py", line 79, in main
trainer.fit(model)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
self.single_gpu_train(model)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 308, in train
self.reset_train_dataloader(model)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 156, in reset_train_dataloader
self.train_dataloader = self.request_dataloader(model.train_dataloader)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 280, in request_dataloader
dataloader = dataloader_fx()
File "/root/mot_neural_solver/src/mot_neural_solver/pl_module/pl_module.py", line 73, in train_dataloader
return self._get_data(mode = 'train')
File "/root/mot_neural_solver/src/mot_neural_solver/pl_module/pl_module.py", line 57, in _get_data
logger=None)
File "/root/mot_neural_solver/src/mot_neural_solver/data/mot_graph_dataset.py", line 33, in __init__
self.seq_det_dfs, self.seq_info_dicts, self.seq_names = self._load_seq_dfs(seqs_to_retrieve)
File "/root/mot_neural_solver/src/mot_neural_solver/data/mot_graph_dataset.py", line 82, in _load_seq_dfs
seq_det_df = seq_processor.load_or_process_detections()
File "/root/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 381, in load_or_process_detections
seq_det_df = self.process_detections()
File "/root/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 347, in process_detections
self._store_embeddings()
File "/root/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 307, in _store_embeddings
node_out, reid_out = self.cnn_model(bboxes.cuda())
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/mot_neural_solver/src/mot_neural_solver/models/resnet.py", line 272, in forward
f = self.featuremaps(x)
File "/root/mot_neural_solver/src/mot_neural_solver/models/resnet.py", line 263, in featuremaps
x = self.relu(x)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/activation.py", line 94, in forward
return F.relu(input, inplace=self.inplace)
File "/root/miniconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/functional.py", line 1061, in relu
result = torch.relu_(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
thanks a lot, Look forward to your favourable reply
wzgliang commented
i changed a lower gpu, and it works. so 3080 not, xp is.