multi-gpu machines break tracking/Monitor.py at line 36
Closed this issue · 2 comments
Here are my notes on the problem and temporary fix which is implemented in the cognitivesynergy branch.
On hg2_home
tracker_vision | 2023-05-28 00:14:41,040 - mmcv - INFO - load checkpoint from http path: https://download.openmmlab.com/mmtracking/mot/reid/tracktor_reid_r50_iter25245-a452f51f.pth
tracker_vision | 2023-05-28 00:14:41,092 - mmcv - WARNING - The model and loaded state dict do not match exactly
tracker_vision |
tracker_vision | missing keys in source state_dict: head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias
tracking/Monitor.py line 36
(base) hmlatapie@hg2:/devcisco/DeepVision$ nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits/devcisco/DeepVision$
6, 8192, 0, 0, 57
(base) hmlatapie@hg2:
On df01
100%|██████████| 98.4M/98.4M [00:19<00:00, 5.24MB/s]
tracker_vision | 2023-05-28 00:29:00,398 - mmcv - WARNING - The model and loaded state dict do not match exactly
tracker_vision |
tracker_vision | missing keys in source state_dict: head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias
tracker_vision |
tracker_vision | Warning: The model doesn't have classes
tracker_vision | Traceback (most recent call last):
tracker_vision | File "/mmtracking/tracker.py", line 143, in
tracker_vision | main()
tracker_vision | File "/mmtracking/tracker.py", line 128, in main
tracker_vision | gpu_calculation.add()
tracker_vision | File "/mmtracking/Monitor.py", line 36, in add
tracker_vision | gpu_temp = int(gpu_stats[4])
tracker_vision | ValueError: invalid literal for int() with base 10: '56\n8'
tracker_vision exited with code 1
hmlatapie@df01:~/devcisco/DeepVision/tracking$ nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits
178, 12288, 0, 0, 45
6, 12288, 0, 0, 43
6, 11264, 0, 0, 42
6, 11264, 0, 0, 36
def get_first_line_or_original(input_string):
# Split the string by newline characters
lines = input_string.split('\n')
# Check if there are more than one lines
if len(lines) > 1:
return lines[0] # return the first line if there are multiple lines
return input_string # return the original string if there's only one line
The temporary fix
def get_first_line_or_original(input_string):
# Split the string by newline characters
lines = input_string.split('\n')
# Check if there are more than one lines
if len(lines) > 1:
return lines[0] # return the first line if there are multiple lines
return input_string # return the original string if there's only one line
def add(self):
self.count=self.count+1
if(self.count%15==0):
cmd = "nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits"
output = subprocess.check_output(cmd, shell=True)
output = output.decode('utf-8')
output = get_first_line_or_original(output)
gpu_stats = output.strip().split(', ')
memory_used = int(gpu_stats[0])
memory_total = int(gpu_stats[1])
gpu_utilization = int(gpu_stats[2])
this bug is fixed in the cognitivesynergy branch
problem resolved in current main branch