aws-neuron/transformers-neuronx

Very long compilation times for llama2 with batch size 4

Closed this issue · 4 comments

With AWS Neuron SDK 2.14.1, I am experiencing very long compilation times for batch_size = 4 with the llama2 7B model.

I am using the following configurations:

|             | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree   | 2            | 24            |
| n_positions | 2048         | 2048          |
| amp         | f16          | f16           |

With batch_size = 1, 2 it takes minutes to compile the model with the -O1 option, but with batch_size = 4 it lasts more than three hours.

ack, asking compiler team to take a look

Hi dacorvo - We have root caused this issue with the llama2 batch 4 config. The fix will be in the next Neuron SDK release.

This is excellent news ! Thanks for the update.

This issue was fixed in Neuron SDK release 2.15.