No memory reduction observed in a simple sparse-dense multiplication
x-zho14 opened this issue · 0 comments
x-zho14 commented
Hi, I experiment with the following codes:
import torch
from pytorch_block_sparse import BlockSparseLinear
import time
import sys
iter = int(sys.argv[1])
dsty = float(sys.argv[2])
fc = BlockSparseLinear(1024, 256, density=dsty)
fc_dense = torch.nn.Linear(1024, 256).cuda()
input = torch.ones(3, 1024).cuda()
i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()
while(i < iter):
output = fc(input)
i += 1
end.record()
t2 = time.time()
torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())
i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()
while(i < iter):
output = fc_dense(input)
i += 1
end.record()
t2 = time.time()
torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())
And I find that the running time is decreased when iteration is small, while the memory consumption is not decreased.
sparse:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 1248 KB | 1254 KB | 7280 KB | 6032 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 1248 KB | 1254 KB | 7280 KB | 6032 KB |
|---------------------------------------------------------------------------|
| Active memory | 1248 KB | 1254 KB | 7280 KB | 6032 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 1248 KB | 1254 KB | 7280 KB | 6032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 2048 KB | 2048 KB | 2048 KB | 0 B |
| from large pool | 0 KB | 0 KB | 0 KB | 0 B |
| from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 800 KB | 2047 KB | 8080 KB | 7280 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 800 KB | 2047 KB | 8080 KB | 7280 KB |
|---------------------------------------------------------------------------|
| Allocations | 12 | 15 | 2066 | 2054 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 12 | 15 | 2066 | 2054 |
|---------------------------------------------------------------------------|
| Active allocs | 12 | 15 | 2066 | 2054 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 12 | 15 | 2066 | 2054 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 1 | 1 | 1 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 5 | 5 | 1033 | 1028 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 5 | 5 | 1033 | 1028 |
|===========================================================================|
dense:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 1248 KB | 1251 KB | 4280 KB | 3032 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 1248 KB | 1251 KB | 4280 KB | 3032 KB |
|---------------------------------------------------------------------------|
| Active memory | 1248 KB | 1251 KB | 4280 KB | 3032 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 1248 KB | 1251 KB | 4280 KB | 3032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 2048 KB | 2048 KB | 2048 KB | 0 B |
| from large pool | 0 KB | 0 KB | 0 KB | 0 B |
| from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 800 KB | 2047 KB | 5080 KB | 4280 KB |
| from large pool | 0 KB | 0 KB | 0 KB | 0 KB |
| from small pool | 800 KB | 2047 KB | 5080 KB | 4280 KB |
|---------------------------------------------------------------------------|
| Allocations | 12 | 15 | 1066 | 1054 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 12 | 15 | 1066 | 1054 |
|---------------------------------------------------------------------------|
| Active allocs | 12 | 15 | 1066 | 1054 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 12 | 15 | 1066 | 1054 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 1 | 1 | 1 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 5 | 5 | 533 | 528 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 5 | 5 | 533 | 528 |
|===========================================================================|
Could you please help with finding the problem? Actually the total alloc memory is even higher. Thanks in advance.