Improve Neuron model loading time

Question

Improve Neuron model loading time

Opened this issue 8 months ago · 4 comments

This is not a bug, but rather a feature request: even when pre-compiled artifacts are available, loading a model on neuron cores can take a very long time.

This seems especially true when loading a model for the first time after an instance as been started, which happens when deploying models through Sagemaker.

For instance, it can take up to 10 minutes to upload a Llama 7b model when deploying through SageMaker (regardless of the instance type).

Answer 1 · 2024-04-09T18:54:03.000Z

Hello,

We have recently made some improvements to weight load times by directly supporting safetensors checkpoints.

When loading llama 7b (with a pre-populated compilation cache on trn1.32xlarge) I measure a time of ~40 seconds using a safetensors checkpoint:

import time
from transformers_neuronx import NeuronAutoModelForCausalLM

begin = time.time()
model = NeuronAutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b', tp_degree=32)
model.to_neuron()
end = time.time()

print('Duration:', end - begin)

Can you check if using a safetensors checkpoint improves your load duration? If you still observe slow load times, would you be able to provide a reproduction so we can determine exactly which portion of the model load is taking long? Is this maybe occurring only on a specific instance type?

Answer 2 · 2024-04-16T09:33:20.000Z

I just tested this change on meta-llama/Llama-2-7b-chat-hf, loading the pre-compiled model from either the legacy split files or directly from safetensor weights.

Export parameters:

batch_size 4,
tp_degree 2,
sequence_length 4096,
auto_cast_type fp16.

On a ml.inf2.8xlarge:

split files: model loaded in 43.75 s
safetensors: model loaded in 43.75 s.

So I cannot say there is a benefit from loading safetensor files.

Answer 3 · 2024-04-16T13:33:54.000Z

Same test immediately after a reboot, still on an ml.inf2.8xlarge:

split files: Neuron model loaded in 134.06 s.
safetensors: model loaded in 133.50 s.

Answer 4 · 2024-04-16T13:36:24.000Z

I did the same test twice after a reboot, and I get consistent results: the model takes longer to load.
Note also that after several attempts, without rebooting, I also get from time to time the same long loading time.