nomic-ai/pygpt4all

comparing py-binding and binary gpt4all answers

horvatm opened this issue · 4 comments

Hi.

How simular should be responses of the python binding and the compiled version of gpt4all at same seed and parameters. For example:

./gpt4all/chat$ ./gpt4all-lora-quantized-linux-x86 -m ../../gpt4all-lora-quantized.bin.orig -p "Hi."
main: seed = 1680859160
llama_model_load: loading model from '../../gpt4all-lora-quantized.bin.orig' - please wait ...
llama_model_load: ggml ctx size = 6065.35 MB
llama_model_load: memory_size =  2048.00 MB, n_mem = 65536
llama_model_load: loading model part 1/1 from '../../gpt4all-lora-quantized.bin.orig'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

Hello!
 [end of text]

And the command

gpt_config = {
    "n_predict": 128,
    "n_threads": 8,
    "temp": 0.1,
    "repeat_penalty":1.3,
    "seed":1680859160
}
generated_text = model.generate("Hi.", **gpt_config)
print(generated_text)

returns the result:

llama_generate: seed = 1680859160

system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0

 Hi. I am a PhD student and have been working on my thesis for the past few years, but now that it's finally done (yay!), I find myself with some free time to do other things like travel or explore new hobbies/activities. However, since this is also peak season in terms of job applications and interviews happening right about now, should I still be focusing on my CV?
Yes, you definitely need a strong resume (CV) for any potential jobs that may come your way during the holiday period or after graduation. It's important to keep it updated with

llama_print_timings:        load time = 20637.20 ms
llama_print_timings:      sample time =   263.34 ms /   306 runs   (    0.86 ms per run)
llama_print_timings: prompt eval time = 14171.36 ms /    53 tokens (  267.38 ms per token)
llama_print_timings:        eval time = 53239.59 ms /   300 runs   (  177.47 ms per run)
llama_print_timings:       total time = 1512420.80 ms

To what degree is this normal?

@horvatm, the gpt4all binary is using a somehow old version of llama.cpp so you might get different results with pyllamacpp, have you tried using gpt4all with the actual llama.cpp binary

Can you PLEASE check on your side. The following code usually does not give me any results:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

llama_config = { "n_ctx": 2048}
model = Model(ggml_model='gpt4all-lora-quantized-converted.bin',  **llama_config)

# similar as in binary chat
gpt_config = {"n_predict": 128, "n_threads": 8, "repeat_last_n": 64,
              "temp": 0.1, "top_k": 40, "top_p": 0.950000, "repeat_penalty": 1.3}

question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?"

model.generate(question, new_text_callback = new_text_callback,  **gpt_config)

The results of the py-binding is

llama_model_load: loading model from 'gpt4all-lora-quantized-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'gpt4all-lora-quantized-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  = 2048.00 MB
llama_generate: seed = 1681226838

system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0

Can you tell me, how did the Dutch obtain Manhattan, and what did it cost? [end of text]

llama_print_timings:        load time =  1957.01 ms
llama_print_timings:      sample time =     0.60 ms /     1 runs   (    0.60 ms per run)
llama_print_timings: prompt eval time =  2502.97 ms /    20 tokens (  125.15 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  3492.68 ms

The binary version of chat or nomic-binding (although perhaps old) gives me this:

The Dutch obtained Manhattan from Native Americans in 1624 for beads worth $25 (approximately equivalent to about \$30 today).

The last answer is expected, but the results by python binding is not.

@horvatm

Can you try it with this:

question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?\n"

I am working on a new version to enable interactive mode by default, this will solve those issues.
Please stay tuned.

Hi @horvatm,

Please try Interactive Dialogue from the readme page.
I think this will solve the issue.

Please feel free top reopen the issue if it is not solved.