Oh hell yes

Thank you for this! I've been using llama-cpp-wasm and the 2GB size restriction was a real stumbling block.

I'm finally attempting to replace llama_cpp_wasm with Wllama.

I was wondering if you have suggestions on how to replace some callbacks:

Is there a downloading and/or loading callback? I'd like to keep the user informed about download progress.
Is there a chunk callback, where the model returns the latest generated token? The readme doesn't mention any such ability, and a search for 'chunk' in the codebase only gives results referring to breaking the LLM's into chunks.
How do I best abort inference?
Should I unload a model before switching to a different one?

Unrelated: have you per chance tried running Phi 3 with Wllama? I know the 128K context is not officially supported yet, but there does seem to be some success with getting 64K context. I'm personally really looking forward to when Phi 3 128 is supported, as I suspect it would be the ultimate small "do it all" model for browser-based contexts.

Here's a sneak preview of what I'm integrating it in. Can't wait to release it to the world :-) I'm happy to give you a link if you're curious.

More questions as I'm going along:

The documentation doesn't mention what the defaults are for the various configuration options? Perhaps those could be added? It would be nice to know what the default context size and temperature are, for example.
Options like cache_type_k seem important. What happens if I don't set them, or set them incorrectly? How should I set them? I'm loading a Q4_K_M model, should I set it to q4_0? Or does this mean that only q4_0 quantization is supported?

Oh darn, the advanced example answers a lot of my questions, apologies.https://github.com/ngxson/wllama/blob/master/examples/advanced/index.html

// Is this a bug in the example? Setting the same property twice:

wllama/examples/advanced/index.html

Line 56 in 2450545

top_p: 0.95,

Cool project! Thanks for paying attention for wllama.

Is there a downloading and/or loading callback? I'd like to keep the user informed about download progress.

I planned to add one (and cache control options) but there're still some issues. If you want, you can implement your own download function (with callback), then pass the final buffer to loadModel() (instead of using loadModelFromUrl)

wllama/src/wllama.ts

Line 125 in 2450545

    
           async loadModel(ggufBuffer: Uint8Array | Uint8Array[], config: LoadModelConfig): Promise<void> {

Is there a chunk callback, where the model returns the latest generated token? The readme doesn't mention any such ability, and a search for 'chunk' in the codebase only gives results referring to breaking the LLM's into chunks.

If you want to have more control over the response, you can implement your own createCompletion. All lower-level API like tokenize, decode,... are exposed:

wllama/src/wllama.ts

Line 224 in 2450545

async createCompletion(prompt: string, options: {

How do I best abort inference?

By implementing your own createCompletion, you can abort the inference by interrupting the loop to generate new tokens.

Should I unload a model before switching to a different one?

Yes, since models are loaded into RAM, it's better to unload the model before loading a new one to prevent running out of RAM.

The documentation doesn't mention what the defaults are for the various configuration options?

No because many default options are defined inside llama.cpp (cpp code, not javascript level). I'm planning to copy then into this project in the future. This requires parsing cpp code and convert them either into ts/js, either simply generate a markdown documentation. Either way will be quite complicated.

For now, you can see default values in llama.h file: https://github.com/ggerganov/llama.cpp/blob/master/llama.h

Options like cache_type_k seem important. What happens if I don't set them, or set them incorrectly? How should I set them? I'm loading a Q4_K_M model, should I set it to q4_0? Or does this mean that only q4_0 quantization is supported?

cache_type_k is controlled by llama.cpp, not at the javascript level. For now, llama.cpp use f16 by default, but it also supports q4_0. Please pay attention that the support for quantized cache is still quite experimental in llama.cpp and may degrade response quality

// Is this a bug in the example? Setting the same property twice:

Yes, it's a typo. Because the index.html file is not typescript, I don't have any suggestion from IDE. One should be top_p and the other should be top_k.

Whoop! I've got initial implementation working!

Now to get to the details.

I planned to add one (and cache control options) but there're still some issues. If you want, you can implement your own download function (with callback), then pass the final buffer to loadModel() (instead of using loadModelFromUrl)

I went ahead and created a very minimal implementation of a download progress callback in a PR. It should hold me over until your prefered implementation is done, to which I'll then update.

If you want to have more control over the response, you can implement your own createCompletion

By looking at the advanced example I found the onNewToken: (token, piece, currentText) => { part, which was exactly what I needed.

I'm going to see if I can hack in an abort button next :-)

It seems I can simply call exit() on the Wllama object when the user wants to interrupt inference. The model will then need to be reloaded, but that's ok.

I created an extremely minimalist way to interrupt the inference here:
flatsiedatsie@a9fe166

Wllama now has a built-in interruption ability.

window.interrupt_wllama = false;
let response_so_far = "";

const outputText = await window.llama_cpp_app.createCompletion(total_prompt, {
	nPredict: 500,
	sampling: {
	    temp: 0.7,
	    top_k: 40,
	    top_p: 0.9,
	},
	onNewToken: (token, piece, currentText, { abortSignal }) => {
            if (window.interrupt_wllama) {
		console.log("sending interrupt signal to Wllama");
		abortSignal();
	    }
	    else{
		//console.log("wllama: onNewToken:  token,piece,currentText:", token, piece, currentText);
		let new_chunk = currentText.substr(response_so_far.length);
		window.handle_chunk(my_task,response_so_far,new_chunk);
		response_so_far = currentText;
	    }
	},
});