comparing py-binding and binary gpt4all answers
horvatm opened this issue · 4 comments
Hi.
How simular should be responses of the python binding and the compiled version of gpt4all at same seed and parameters. For example:
./gpt4all/chat$ ./gpt4all-lora-quantized-linux-x86 -m ../../gpt4all-lora-quantized.bin.orig -p "Hi."
main: seed = 1680859160
llama_model_load: loading model from '../../gpt4all-lora-quantized.bin.orig' - please wait ...
llama_model_load: ggml ctx size = 6065.35 MB
llama_model_load: memory_size = 2048.00 MB, n_mem = 65536
llama_model_load: loading model part 1/1 from '../../gpt4all-lora-quantized.bin.orig'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
system_info: n_threads = 4 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Hello!
[end of text]
And the command
gpt_config = {
"n_predict": 128,
"n_threads": 8,
"temp": 0.1,
"repeat_penalty":1.3,
"seed":1680859160
}
generated_text = model.generate("Hi.", **gpt_config)
print(generated_text)
returns the result:
llama_generate: seed = 1680859160
system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0
Hi. I am a PhD student and have been working on my thesis for the past few years, but now that it's finally done (yay!), I find myself with some free time to do other things like travel or explore new hobbies/activities. However, since this is also peak season in terms of job applications and interviews happening right about now, should I still be focusing on my CV?
Yes, you definitely need a strong resume (CV) for any potential jobs that may come your way during the holiday period or after graduation. It's important to keep it updated with
llama_print_timings: load time = 20637.20 ms
llama_print_timings: sample time = 263.34 ms / 306 runs ( 0.86 ms per run)
llama_print_timings: prompt eval time = 14171.36 ms / 53 tokens ( 267.38 ms per token)
llama_print_timings: eval time = 53239.59 ms / 300 runs ( 177.47 ms per run)
llama_print_timings: total time = 1512420.80 ms
To what degree is this normal?
@horvatm, the gpt4all binary is using a somehow old version of llama.cpp
so you might get different results with pyllamacpp
, have you tried using gpt4all
with the actual llama.cpp
binary
Can you PLEASE check on your side. The following code usually does not give me any results:
from pyllamacpp.model import Model
def new_text_callback(text: str):
print(text, end="", flush=True)
llama_config = { "n_ctx": 2048}
model = Model(ggml_model='gpt4all-lora-quantized-converted.bin', **llama_config)
# similar as in binary chat
gpt_config = {"n_predict": 128, "n_threads": 8, "repeat_last_n": 64,
"temp": 0.1, "top_k": 40, "top_p": 0.950000, "repeat_penalty": 1.3}
question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?"
model.generate(question, new_text_callback = new_text_callback, **gpt_config)
The results of the py-binding is
llama_model_load: loading model from 'gpt4all-lora-quantized-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx = 2048
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'gpt4all-lora-quantized-converted.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 2048.00 MB
llama_generate: seed = 1681226838
system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0
Can you tell me, how did the Dutch obtain Manhattan, and what did it cost? [end of text]
llama_print_timings: load time = 1957.01 ms
llama_print_timings: sample time = 0.60 ms / 1 runs ( 0.60 ms per run)
llama_print_timings: prompt eval time = 2502.97 ms / 20 tokens ( 125.15 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 3492.68 ms
The binary version of chat or nomic-binding (although perhaps old) gives me this:
The Dutch obtained Manhattan from Native Americans in 1624 for beads worth $25 (approximately equivalent to about \$30 today).
The last answer is expected, but the results by python binding is not.
Can you try it with this:
question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?\n"
I am working on a new version to enable interactive mode by default, this will solve those issues.
Please stay tuned.
Hi @horvatm,
Please try Interactive Dialogue from the readme page.
I think this will solve the issue.
Please feel free top reopen the issue if it is not solved.