Torchscript Trace slower with C++ runtime environment.

Question

Torchscript Trace slower with C++ runtime environment.

sukuya opened this issue 6 years ago · 3 comments

I traced the BERT model from PyTorchTransformers library and getting the following results for 10 iterations.
a) Using Python runtime for running the forward: 979,292 µs

import time
model = torch.jit.load('models_backup/2_2.pt')
x = torch.randint(2000, (1, 14), dtype=torch.long, device='cpu')
start = time.time()
for i in range(10):
    model(x)
end = time.time()
print((end - start)*1000000, "µs")

b) Using C++ runtime for running the forward: 3,333,758 µs which is almost 3x of what Python

torch::Tensor x = torch::randint(index_max, {1, inputsize}, torch::dtype(torch::kInt64).device(torch::kCPU));
input.push_back(x);
#endif
// Execute the model and turn its output into a tensor.
auto outputs = module->forward(input).toTuple();
auto start = chrono::steady_clock::now();
for (int16_t i = 0; i<10; ++i)
{
  outputs = module->forward(input).toTuple();
}
auto end = chrono::steady_clock::now();
cout << "Elapsed time in microseconds : " 
  	<< chrono::duration_cast<chrono::microseconds>(end - start).count()
  	<< " µs" << endl;

@thomwolf any suggestions on what am I missing ?

Answer 1 · 2019-08-08T03:58:43.000Z

2 possible reasons:

the first time you run forward will do some preheating work, maybe you should exclude the first run.
try exclude toTuple

According to my experience, jit with python or c++ will cost almost the same time.

Answer 2 · 2019-08-08T05:30:04.000Z

@Meteorix Forward is called once before the loop, are you talking about something else.
Excluding toTuple doesn't help.

Answer 3 · 2019-10-07T05:54:29.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.