replicate/replicate-python

`meta/llama-2-70b` maximum input size (1024) differs from the LLaMA-2 maximum context size (4096 tokens)

jdkanu opened this issue · 0 comments

LLaMA-2 models have a maximum input size of 4096 tokens [original paper, meta llama github repo]. When prompting meta/llama-2-70b through replicate, however, the maximum size of the model is, strangely, 1024, which is causing an error that crashes my program.

[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f2d5e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f2d5e9b41cd]
2       0x7f2d609dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f2d609dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f2e9b3f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2e9b3f2253]
5       0x7f2e9b181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2e9b181ac3]
6       0x7f2e9b212a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 117809844: Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f2d5e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f2d5e9b41cd]
2       0x7f2d609dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f2d609dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f2e9b3f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2e9b3f2253]
5       0x7f2e9b181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2e9b181ac3]
6       0x7f2e9b212a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f33469b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f33469b41cd]
2       0x7f33489dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f33489dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f3472df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3472df2253]
5       0x7f3472b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3472b81ac3]
6       0x7f3472c12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f86529b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f86529b41cd]
2       0x7f86549dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f86549dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f8781df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8781df2253]
5       0x7f8781b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8781b81ac3]
6       0x7f8781c12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f58929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f58929b41cd]
2       0x7f58949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f58949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f59bebf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f59bebf2253]
5       0x7f59be981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f59be981ac3]
6       0x7f59bea12a04 clone + 68
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 224, in _handle_predict_error
yield
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 253, in _predict_async
async for r in result:
File "/src/predict.py", line 180, in predict
output = event.json()["text_output"]
KeyError: 'text_output'

https://replicate.com/p/le6b6jtbp565nmjftq4lxsy44i

Even the smaller 7B models do not return this error when called with the same prompt (same input size).

It looks like the wrong model is being called for meta/llama-2-70b. LLaMA-2 should not complain about an input of just 1240 tokens. If so, then I and potentially many other customers calling meta/llama-2-70b are paying for calls to a different model! Not what we asked for, not what was advertised, and not returning output when it should! Please correct!