huggingface/pytorch_block_sparse

No memory reduction observed in a simple sparse-dense multiplication

x-zho14 opened this issue · 0 comments

Hi, I experiment with the following codes:

import torch
from pytorch_block_sparse import BlockSparseLinear
import time
import sys
iter = int(sys.argv[1])
dsty = float(sys.argv[2])

fc = BlockSparseLinear(1024, 256, density=dsty)
fc_dense = torch.nn.Linear(1024, 256).cuda()
input = torch.ones(3, 1024).cuda()

i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()

while(i < iter):
    output = fc(input)
    i += 1
end.record()
t2 = time.time()

torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())

i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()

while(i < iter):
    output = fc_dense(input)
    i += 1

end.record()
t2 = time.time()
torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())

And I find that the running time is decreased when iteration is small, while the memory consumption is not decreased.
sparse:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|---------------------------------------------------------------------------|
| Active memory         |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |     800 KB |    2047 KB |    8080 KB |    7280 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |     800 KB |    2047 KB |    8080 KB |    7280 KB |
|---------------------------------------------------------------------------|
| Allocations           |      12    |      15    |    2066    |    2054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    2066    |    2054    |
|---------------------------------------------------------------------------|
| Active allocs         |      12    |      15    |    2066    |    2054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    2066    |    2054    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       1    |       1    |       1    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       5    |    1033    |    1028    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       5    |       5    |    1033    |    1028    |
|===========================================================================|

dense:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|---------------------------------------------------------------------------|
| Active memory         |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |     800 KB |    2047 KB |    5080 KB |    4280 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |     800 KB |    2047 KB |    5080 KB |    4280 KB |
|---------------------------------------------------------------------------|
| Allocations           |      12    |      15    |    1066    |    1054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    1066    |    1054    |
|---------------------------------------------------------------------------|
| Active allocs         |      12    |      15    |    1066    |    1054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    1066    |    1054    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       1    |       1    |       1    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       5    |     533    |     528    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       5    |       5    |     533    |     528    |
|===========================================================================|

Could you please help with finding the problem? Actually the total alloc memory is even higher. Thanks in advance.