multi-gpu machines break tracking/Monitor.py at line 36

Question

multi-gpu machines break tracking/Monitor.py at line 36

Closed this issue 2 years ago · 2 comments

Here are my notes on the problem and temporary fix which is implemented in the cognitivesynergy branch.

On hg2_home

tracker_vision | 2023-05-28 00:14:41,040 - mmcv - INFO - load checkpoint from http path: https://download.openmmlab.com/mmtracking/mot/reid/tracktor_reid_r50_iter25245-a452f51f.pth
tracker_vision | 2023-05-28 00:14:41,092 - mmcv - WARNING - The model and loaded state dict do not match exactly
tracker_vision |
tracker_vision | missing keys in source state_dict: head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias

tracking/Monitor.py line 36
(base) hmlatapie@hg2:/devcisco/DeepVision$ nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits
6, 8192, 0, 0, 57
(base) hmlatapie@hg2:/devcisco/DeepVision$

On df01
100%|██████████| 98.4M/98.4M [00:19<00:00, 5.24MB/s]
tracker_vision | 2023-05-28 00:29:00,398 - mmcv - WARNING - The model and loaded state dict do not match exactly
tracker_vision |
tracker_vision | missing keys in source state_dict: head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias
tracker_vision |
tracker_vision | Warning: The model doesn't have classes
tracker_vision | Traceback (most recent call last):
tracker_vision | File "/mmtracking/tracker.py", line 143, in
tracker_vision | main()
tracker_vision | File "/mmtracking/tracker.py", line 128, in main
tracker_vision | gpu_calculation.add()
tracker_vision | File "/mmtracking/Monitor.py", line 36, in add
tracker_vision | gpu_temp = int(gpu_stats[4])
tracker_vision | ValueError: invalid literal for int() with base 10: '56\n8'
tracker_vision exited with code 1

hmlatapie@df01:~/devcisco/DeepVision/tracking$ nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits
178, 12288, 0, 0, 45
6, 12288, 0, 0, 43
6, 11264, 0, 0, 42
6, 11264, 0, 0, 36

def get_first_line_or_original(input_string):
# Split the string by newline characters
lines = input_string.split('\n')

# Check if there are more than one lines
if len(lines) > 1:
    return lines[0]  # return the first line if there are multiple lines

return input_string  # return the original string if there's only one line

The temporary fix
def get_first_line_or_original(input_string):
# Split the string by newline characters
lines = input_string.split('\n')

    # Check if there are more than one lines
    if len(lines) > 1:
        return lines[0]  # return the first line if there are multiple lines

    return input_string  # return the original string if there's only one line


def add(self):
    self.count=self.count+1
    if(self.count%15==0):
        cmd = "nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu,utilization.memory,temperature.gpu --format=csv,noheader,nounits"
        output = subprocess.check_output(cmd, shell=True)
        output = output.decode('utf-8')
        output = get_first_line_or_original(output)
        gpu_stats = output.strip().split(', ')  
        memory_used = int(gpu_stats[0])
        memory_total = int(gpu_stats[1])
        gpu_utilization = int(gpu_stats[2])

Answer 1 · 2023-05-28T02:43:07.000Z

this bug is fixed in the cognitivesynergy branch

Answer 2 · 2023-06-01T03:19:06.000Z

problem resolved in current main branch