cmp-nct/ggllm.cpp

falcon_main chokes on larger prompts (linux)

Closed this issue · 16 comments

I tried running the following command, with a 5 bit quantised model:

./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12 -c 2048 --repeat_penalty 1.0 --color -p "The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:" --top_k 10000

but it gave me the following output:

main: build = 677 (dd3d346)
main: seed  = 1687011598
falcon.cpp: loading model from ../falcon/ggml-model--f32-q5_1.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 4544
falcon_model_load_internal: n_head     = 71
falcon_model_load_internal: n_head_kv     = 1
falcon_model_load_internal: n_layer    = 32
falcon_model_load_internal: version      = 7
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 18176
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 7B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: mem required  = 6955.12 MB (+   32.00 MB per state)
.....................................................................................
falcon_init_from_file: kv self size  =   32.00 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:ggml_new_tensor_impl: not enough space in the context's memory pool (needed 1184709184, available 805306368)
[1]    44634 segmentation fault (core dumped)  ./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12 -c 2048  1.0  -

Interestingly, the command works with a smaller prompt:

./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12  --repeat_penalty 1.0 --color -p "Hi," --top_k 10000
main: build = 677 (dd3d346)
main: seed  = 1687012039
falcon.cpp: loading model from ../falcon/ggml-model--f32-q5_1.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 4544
falcon_model_load_internal: n_head     = 71
falcon_model_load_internal: n_head_kv     = 1
falcon_model_load_internal: n_layer    = 32
falcon_model_load_internal: version      = 7
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 18176
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 7B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: mem required  = 6955.12 MB (+    8.00 MB per state)
.....................................................................................
falcon_init_from_file: kv self size  =    8.00 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


Hi, I just purchased 2 tickets to the MAMiT #2 at the Home Depot Center in Carson, CA on October 29th.
Here's my question: I am flying in to LA the day before (October 28th) and would love to meet up with others that will be at the event if they are going.
My e-mail is [email protected] if you'd like to contact me there and any info you can give would be great!
Thank you in advance!
Sincerely,
Dave
South Carolina<|endoftext|> [end of text]

falcon_print_timings:        load time =   299.65 ms
falcon_print_timings:      sample time =   207.28 ms /   117 runs   (    1.77 ms per token)
falcon_print_timings: prompt eval time =   196.46 ms /     2 tokens (   98.23 ms per token)
falcon_print_timings:        eval time = 20454.49 ms /   116 runs   (  176.33 ms per token)
falcon_print_timings:       total time = 20897.24 ms
                                                         

to be clear, this didn't work even without -c in the first command

Context size calculation is not correct currently, it looks like that's what happened.
I'll try to get to it today or tomorrow

ah, do you have any pointers on what would need to change? I could try tweaking things

I'm trying to use the quantisation version because when I try to use it with the nonquantised f32 version, it actually kills all apps on my computer (every single app suddenly closes) for some reason

Please update your version to the latest, I just gave it a test and it runs fine on 40 and 7B Q5
It looks like the context size issue is not even that bad currently, the problem you have comes from the wrong memory type setting.

I'll try that.

EDIT

synced master, it still doesn't work (same error)

woudl I need to regenerate the quantised models?

It can't hurt to try a fresh quantized version to rule that out.

I suppose you used the instruct model ?
I'll download and try the instruct one.

Below is the normal 7B model which appears to do fine

Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-7b\q5_1 -c 2048 --repeat_penalty 1.0 --color -p "The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:" --top_k 10000
main: build = 677 (dd3d346)
main: seed  = 1687013228
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-7b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 4544
falcon_model_load_internal: n_head     = 71
falcon_model_load_internal: n_head_kv     = 1
falcon_model_load_internal: n_layer    = 32
falcon_model_load_internal: version      = 7
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 18176
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 7B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 6955.12 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
.....................................................................................
falcon_init_from_file: kv self size  =   32.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:
Tucked away in an area that was once the country's

no, I'm using the original 7b model from hf.

okay, just to make sure, here's the list of steps I took:

  1. ran python3 falcon_convert_demo.py 2 $HOME/falcon/7B/ falcon 1 , got an f32 ggml file falcon/ggml-model--f32.bin.
  2. ran (from the build directory) ./bin/falcon_quantize ../falcon/ggml-model--f32.bin ../falcon/ggml-model--f32-q5_1.bin 9, and got a quantised model.
  3. ran the command above.

With all the latest changes, this doesn't work on my computer. Memory isn't an issue since I have 32GB of ram...

this might be a linux related problem, now that I think about it...

It's really strange as the difference in context requirement is not a little bit but quite huge.
When using mmap your context_size is quite small as the tensors are just pointing into existing memory.

Can you try what happens if you use --no-mmap ? Also the -b flag might be worth a test.
Try -b 1 which should process the input prompt one by one

yup, -b 1 works

no-mmap still causes the error.

it's actually very fast with -b 1

I'll close this issue, since there is a way to make things work by reducing the number of batches.