replicate/replicate-python

Do `replicate-internal/staging-llama-2-70b-mlc` and `replicate-internal/llama-2-70b-triton` have different maximum input lengths?

jdkanu opened this issue · 0 comments

I am getting an error that the prompt length exceeds the maximum input length when calling meta/llama-2-70b through the API. I have included the error log from the Replicate dashboard online (see below). I have called the same model in the past without error, and I am almost sure that the prompts were identical or similar in length (prediction data expires for older predictions so I can't verify to be 100% sure). The prompt is also not very long---just 6 question-answering demonstrations with a few intermediate reasoning steps.

Inspecting further, I discovered that there are two different replicate-internal models that are being called to serve the request.

replicate-internal/staging-llama-2-70b-mlc (this one gave me no error)
and
replicate-internal/llama-2-70b-triton (this one gives an error)

Do these models have have different maximum input lengths? If so, how can I call replicate-internal/staging-llama-2-70b-mlc or another llama-2-70b model with a large enough maximum input length?

The error:

[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2       0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5       0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6       0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1728927168: Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2       0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5       0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6       0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7fbb0e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fbb0e9b41cd]
2       0x7fbb109dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7fbb109dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7fbc3cbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbc3cbf2253]
5       0x7fbc3c981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbc3c981ac3]
6       0x7fbc3ca12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7fa92a9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fa92a9b41cd]
2       0x7fa92c9dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7fa92c9dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7faa585f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7faa585f2253]
5       0x7faa58381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7faa58381ac3]
6       0x7faa58412a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f9a329b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f9a329b41cd]
2       0x7f9a349dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f9a349dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f9b5fbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9b5fbf2253]
5       0x7f9b5f981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9b5f981ac3]
6       0x7f9b5fa12a04 clone + 68
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 224, in _handle_predict_error
yield
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 253, in _predict_async
async for r in result:
File "/src/predict.py", line 180, in predict
output = event.json()["text_output"]
KeyError: 'text_output'