Atome-FE/llama-node

llama-node/llama-cpp uses more memory than standalone llama.cpp with the same parameters

fardjad opened this issue · 3 comments

I'm trying to process a large text file. For the sake of reproducibility, let's use this. The following code:

Expand to see the code
import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "node:path";
import fs from "node:fs";

const model = path.resolve(
    process.cwd(),
    "/path/to/model.bin"
);
const llama = new LLM(LLamaCpp);
const prompt = fs.readFileSync("./path/to/file.txt", "utf-8");

await llama.load({
    enableLogging: true,
    modelPath: model,

    nCtx: 4096,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: false,
    nGpuLayers: 0,
});

await llama.createCompletion(
    {
        nThreads: 8,
        nTokPredict: 256,
        topK: 40,
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Crashes the process with a segfault error:

ggml_new_tensor_impl: not enough space in the scratch memory
segmentation fault  node index.mjs

When I compile the exact same version of llama.cpp and run it with the following args:

./main -m /path/to/ggml-vic7b-q5_1.bin -t 8 -c 4096 -n 256 -f ./big-input.txt

It runs perfectly fine (of course with a warning that the context is larger than what the model supports but it doesn't crash with a segfault).

Comparing the logs:

llama-node Logs
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 2052.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 4096.00 MB
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::context] - AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None
ggml_new_tensor_impl: not enough space in the scratch memory
llama.cpp Logs
main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 561 (5ea4339)
main: seed  = 1685284790
llama.cpp: loading model from ../my-llmatic/models/ggml-vic7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 2048.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0

Looks like the context size in llama-node is set to 4GBs and the kv self size is twice as large as what llama.cpp used.

I'm not sure if I'm missing something in my Load/Invocation config or if that's an issue in llama-node. Can you please have a look?

sure, will look into this soon.

I guess it was caused by useMmap?
llama.cpp will enable useMmap by default. What I v found in your llama-node code example, you seems did not enable mmap for reusing file cache in the memory, that is probably why you run out of memory I think?

I'm afraid that is not the case. Before you updated the version of llama.cpp, I couldn't run my example (with or without setting useMmap). Now it doesn't crash, but it doesn't seem to be doing anything either.

I recorded a video comparing llama-node and llama.cpp:

llama-node-issue.mp4

As you can see, llama-node sort of freezes with the larger input, whereas llama.cpp starts emitting tokens after ~30 secs.