QwenLM/qwen.cpp

GGML_ASSERT when using a long prompt

Ayahuasec opened this issue · 2 comments

My system is Ubuntu 22 running on x86_64 CPU.
I compiled with the commands in README, as follows:

cmake -B build
cmake --build build -j --config Release

Then I run the main program as follows:

/opt/qwen.cpp/build/bin/main -v -t 6 --tiktoken /opt/qwen.tiktoken -m /opt/qwen14b-chat-q4_1-ggml.bin -l 8192 -c 8192 -p "$(cat prompt.txt)"

When the prompt is short, it works well. But it seems the boundary is 2000 tokens. When a long prompt.txt is used, like an article of about 10KB (which is about 3K tokens in llama.cpp), it instantly shows

system info: | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
inference config: | max_length = 8192 | max_context_length = 8192 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 6 |
loaded qwen model from /opt/qwen14b-chat-q4_1-ggml.bin within: 46.398 ms

GGML_ASSERT: /opt/qwen.cpp/third_party/ggml/src/ggml.c:4895: view_src == NULL || data_size + view_offs <= ggml_nbytes(view_src)
Aborted (core dumped)

I tried to increase the MEM_SIZE to 4096MB and SCRATCH_SIZE to 10240MB in qwen.h and recompile it, but the output did not change.
I also tried to reduce the context length argument as follows:

./build/bin/main -v -t 6 --tiktoken ./qwen.tiktoken -m ./qwen14b-chat-q4_1-ggml.bin -l 8192 -c 2048 -p "$(cat prompt.txt)"

The program runs for a while and also shows:

system info: | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
inference config: | max_length = 8192 | max_context_length = 2048 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 6 |
loaded qwen model from /opt/qwen14b-chat-q4_1-ggml.bin within: 46.87 ms

ThisGGML_ASSERT: /opt/qwen.cpp/third_party/ggml/src/ggml.c:4895: view_src == NULL || data_size + view_offs <= ggml_nbytes(view_src)
Aborted (core dumped)

The document of https://github.com/QwenLM/Qwen/blob/main/README.md shows Qwen-14b can be extended to up to 8K tokens, and Qwen-7b can be extended to 32K tokens. So can the cpp version supports tokens of more than 2K?

Hi @Ayahuasec, can you provide the prompt.txt file?

@simonJJJ sure, the file is attached. If any additional information is needed, please feel free to let me know.

prompt8_1.txt