Will OperationProfiler underestimate the backward timing?
Closed this issue · 2 comments
Hi @geoffxy,
Thanks for this awsome project!
I found the first element of args
in measure_operation_ms
at here, https://github.com/skylineprof/skyline/blob/master/cli/skyline/profiler/operation.py#L18
is torch.Tensor
.
Will the backward timing measure count the computation time for caculating the gradients with respect to this first argument (input)?
I have create the following script to test. If we didn't wrap the inputs
as nn.Parameter
, the backward time is roughly equal to forward time, which seems counter intuitive to me.
If the wrap the inputs
as nn.Parameter
, then backward takes roughly twice as forward cost, which seems correct.
from skyline.profiler.operation import OperationProfiler
import torch.nn.functional as F
import torch
import torch.nn as nn
import numpy as np
def main():
""""""
bs = 2048
in_feature = 1024
out_feature = 1024
std_dev = np.sqrt(2 / (in_feature + out_feature))
weights = np.random.normal(0, std_dev, size=(out_feature, in_feature)).astype(np.float32)
std_dev = np.sqrt(1 / out_feature)
bias = np.random.normal(0, std_dev, size=out_feature).astype(np.float32)
weights = nn.Parameter(torch.tensor(weights, device='cuda'), requires_grad=True)
bias = nn.Parameter(torch.tensor(bias, device='cuda'), requires_grad=True)
inputs = torch.rand((bs, in_feature)).cuda()
inputs.requires_grad_ = True
# inputs = nn.Parameter(inputs, requires_grad=True)
op_prof = OperationProfiler(warm_up=10, measure_for=20)
fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')
if __name__ == "__main__":
main()
Hi @zarzen!
Thanks for trying Skyline!
I think there might be a problem in your code where the input tensor did not have requires_grad
set to True
. The line inputs.requires_grad_ = True
should actually be inputs.requires_grad_()
.
With this modified script:
from skyline.profiler.operation import OperationProfiler
import torch.nn.functional as F
import torch
import torch.nn as nn
import numpy as np
def main():
""""""
bs = 2048
in_feature = 1024
out_feature = 1024
std_dev = np.sqrt(2 / (in_feature + out_feature))
weights = np.random.normal(0, std_dev, size=(out_feature, in_feature)).astype(np.float32)
std_dev = np.sqrt(1 / out_feature)
bias = np.random.normal(0, std_dev, size=out_feature).astype(np.float32)
weights = nn.Parameter(torch.tensor(weights, device='cuda'), requires_grad=True)
bias = nn.Parameter(torch.tensor(bias, device='cuda'), requires_grad=True)
op_prof = OperationProfiler(warm_up=10, measure_for=20)
print('PyTorch version:', torch.__version__)
print('GPU:', torch.cuda.get_device_name())
print('---')
print('inputs.requires_grad_()')
inputs = torch.rand((bs, in_feature)).cuda()
inputs.requires_grad_() # <-------------------- This line
fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')
print('---')
print('inputs = nn.Parameter(inputs, requires_grad=True)')
inputs = torch.rand((bs, in_feature)).cuda()
inputs = nn.Parameter(inputs, requires_grad=True)
fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')
if __name__ == "__main__":
main()
I get:
PyTorch version: 1.6.0
GPU: GeForce RTX 2070
---
inputs.requires_grad_()
fwd 0.8831232070922852 ms ; bwd 1.5012399673461914 ms
---
inputs = nn.Parameter(inputs, requires_grad=True)
fwd 0.8678496360778809 ms ; bwd 1.472152042388916 ms
which is what I think you expected to see?
What the OperationProfiler
does when measuring the backward pass for an output tensor o
is that it measures the time it takes to run all* the gradient functions in the backward graph, starting from o.grad_fn
to the leaf tensors. Since inputs.requires_grad_ = True
doesn't actually set the inputs
tensor to have inputs.requires_grad == True
, the backward pass does not propagate the gradient to the inputs
tensor. This means that there would be one fewer matrix multiplication needed, which would explain why you saw a similar run time for the forward and backward passes.
*By default it also excludes any AccumulateGrad
s in the backward graph, but that would not have been the cause of the discrepancy that you saw.
I see, Thanks for your clarification!