comparing py-binding and binary gpt4all answers

Question

comparing py-binding and binary gpt4all answers

horvatm opened this issue a year ago · 4 comments

Hi.

How simular should be responses of the python binding and the compiled version of gpt4all at same seed and parameters. For example:

./gpt4all/chat$ ./gpt4all-lora-quantized-linux-x86 -m ../../gpt4all-lora-quantized.bin.orig -p "Hi."
main: seed = 1680859160
llama_model_load: loading model from '../../gpt4all-lora-quantized.bin.orig' - please wait ...
llama_model_load: ggml ctx size = 6065.35 MB
llama_model_load: memory_size =  2048.00 MB, n_mem = 65536
llama_model_load: loading model part 1/1 from '../../gpt4all-lora-quantized.bin.orig'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

Hello!
 [end of text]

And the command

gpt_config = {
    "n_predict": 128,
    "n_threads": 8,
    "temp": 0.1,
    "repeat_penalty":1.3,
    "seed":1680859160
}
generated_text = model.generate("Hi.", **gpt_config)
print(generated_text)

returns the result:

llama_generate: seed = 1680859160

system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0

 Hi. I am a PhD student and have been working on my thesis for the past few years, but now that it's finally done (yay!), I find myself with some free time to do other things like travel or explore new hobbies/activities. However, since this is also peak season in terms of job applications and interviews happening right about now, should I still be focusing on my CV?
Yes, you definitely need a strong resume (CV) for any potential jobs that may come your way during the holiday period or after graduation. It's important to keep it updated with

llama_print_timings:        load time = 20637.20 ms
llama_print_timings:      sample time =   263.34 ms /   306 runs   (    0.86 ms per run)
llama_print_timings: prompt eval time = 14171.36 ms /    53 tokens (  267.38 ms per token)
llama_print_timings:        eval time = 53239.59 ms /   300 runs   (  177.47 ms per run)
llama_print_timings:       total time = 1512420.80 ms

To what degree is this normal?

Answer 1 · 2023-04-07T20:17:50.000Z

@horvatm, the gpt4all binary is using a somehow old version of llama.cpp so you might get different results with pyllamacpp, have you tried using gpt4all with the actual llama.cpp binary

Answer 2 · 2023-04-11T15:39:17.000Z

Can you PLEASE check on your side. The following code usually does not give me any results:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

llama_config = { "n_ctx": 2048}
model = Model(ggml_model='gpt4all-lora-quantized-converted.bin',  **llama_config)

# similar as in binary chat
gpt_config = {"n_predict": 128, "n_threads": 8, "repeat_last_n": 64,
              "temp": 0.1, "top_k": 40, "top_p": 0.950000, "repeat_penalty": 1.3}

question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?"

model.generate(question, new_text_callback = new_text_callback,  **gpt_config)

The results of the py-binding is

llama_model_load: loading model from 'gpt4all-lora-quantized-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'gpt4all-lora-quantized-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  = 2048.00 MB
llama_generate: seed = 1681226838

system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 128, n_keep = 0

Can you tell me, how did the Dutch obtain Manhattan, and what did it cost? [end of text]

llama_print_timings:        load time =  1957.01 ms
llama_print_timings:      sample time =     0.60 ms /     1 runs   (    0.60 ms per run)
llama_print_timings: prompt eval time =  2502.97 ms /    20 tokens (  125.15 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  3492.68 ms

The binary version of chat or nomic-binding (although perhaps old) gives me this:

The Dutch obtained Manhattan from Native Americans in 1624 for beads worth $25 (approximately equivalent to about \$30 today).

The last answer is expected, but the results by python binding is not.

Answer 3 · 2023-04-12T07:11:41.000Z

@horvatm

Can you try it with this:

question = "Can you tell me, how did the Dutch obtain Manhattan, and what did it cost?\n"

I am working on a new version to enable interactive mode by default, this will solve those issues.
Please stay tuned.

Answer 4 · 2023-05-02T20:51:37.000Z

Hi @horvatm,

Please try Interactive Dialogue from the readme page.
I think this will solve the issue.

Please feel free top reopen the issue if it is not solved.