How to test models with larger context length than 128K ?
yaswanth-iitkgp opened this issue · 10 comments
Hi @hsiehjackson ,
I tried using your repo to test HF models like gradientai/Llama-3-8B-Instruct-Gradient-1048k
, but couldn't load the entire model on a single A100 GPU. I wanted to use accelerate library or anything to load the model for experiments greater than 32K (currently we can test upto 32K on my GPU). Would love to hear how we can achieve this.
Have you tried running with vLLM?
Yes, but I was unable to use vLLM with particular HF model. Any suggestions if we can load this using vLLM atleast?
You mean this one https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
? I think if you can load Llama3-8B-instruct, then it should be possible to load this 1M model. Do you see any errors to load with vLLM?
Yes I meant that one. The problem is that when I use hf mode for loading Llama-3-8B
any version, I can only do the tests upto 32K context
, anything beyond that is not possible on a single A100 GPU(80Gb)
. But when I use the vllm mode I cannot use it with llama-3 with even 1k context length, I am attaching the logs in this output_llama3_vllm.log. I can use the vllm mode for other models like https://huggingface.co/THUDM/chatglm3-6b-128K
.
From your log, looks like you plan to evaluate Llama-3-8B with sequence length 16K, and you can get the final prediction files. What errors you mention when you test beyond 1k context length?
I really appreciate your efforts to help solve this problem, thanks a lot.
Yes, but the scores are 0 and nulls are 496/500. So the error I was facing when using vllm mode is the initial part of the log file I attached earlier. So to make things clear I checked it again, then I realized that vllm was giving this error initially for smaller context and I used to stop it in between. Also this error I mentioned in that log file, happens randomly. So I was able to run experiments on 16K context length , but I cannot do it for 8K (i tried multiple times). So i really want to find the cause of that error too. But this error persists for context beyond 32k, I am attaching the log file for 64K context with Llama-3-8B-1M context by gradientai here output_llama3_1M_vllm_64k.log. I was assuming this error might be due to the GPU limitation of 80Gb, pls let me know if it is not.
Is there any way to test context beyond that barrier successfully on this GPU ?
Sorry I missed your reply. When you see the errors in your log, have you checked the logs from server side (vLLM)? You can check whether you got OOM or something.
Yes I see any OOM error on the vLLM, does this mean we cannot do experiments (on gradientai/Llama-3-8B-Instruct-Gradient-1048k) with 64k context length on a single A100 server ??
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 05-24 22:18:09 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer='gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-24 22:18:20 selector.py:16] Using FlashAttention backend.
INFO 05-24 22:18:21 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 05-24 22:18:27 model_runner.py:104] Loading model weights took 15.2075 GB
Traceback (most recent call last):
File "/workspace/scripts/pred/serve_vllm.py", line 116, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 110, in __init__
self.model_executor = executor_class(model_config, cache_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in __init__
self._init_cache()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 80, in _init_cache
self.driver_worker.profile_num_available_blocks(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 131, in profile_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 742, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 663, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 345, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 271, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 223, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 75, in forward
gate_up, _ = self.gate_up_proj(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 215, in forward
output_parallel = self.linear_method.apply_weights(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 79, in apply_weights
return F.linear(x, weight, bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 47.23 GiB is free. Process 532467 has 31.90 GiB memory in use. Of the allocated memory 31.24 GiB is allocated by PyTorch, and 21.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Yeap, looks like that. Have you tried quantization?
No, I haven't tried using a quantized version yet but tried few other GGUF models and it wasn't successful either.