C# bindings for llama.cpp including a .NET core library and sample projects (CLI & Web API).
Build (on Windows use the VS2022 x64 command prompt, on Linux make sure to install cmake and dotnet):
git clone --recursive https://github.com/dranger003/llama.cpp-dotnet.git
cd llama.cpp-dotnet
dotnet build -c Release /p:Platform="Any CPU"
If you don't need to compile the native libraries, you can also append /p:NativeLibraries=OFF
to the dotnet
build command above.
Windows:
cd Release
LlamaCppCli.exe 0 <model_path>
Linux:
cd Release
./LlamaCppCli 0 <model_path>
Usage: LlamaCppCli.dll <SampleIndex> <SampleArgs>
Usage: LlamaCppCli.dll 0 model_path [gpu_layers] [ctx_length] [template]
Usage: LlamaCppCli.dll 1 base_url model_name [gpu_layers] [ctx_length] [template]
You will need a model in GGML format, the 13B parameters appears to perform well if you have the memory (8-12GB depending on the quantized model). If you have a lot of RAM (i.e. 48GB+) you could try a 65B version though it is much slower on the predictions, especially without a GPU.
A lot of models can be found below.
using LlamaCppLib;
// Configure some model options
var modelOptions = new LlamaCppModelOptions
{
ContextSize = 2048,
GpuLayers = 24,
// ...
};
// Load model file
using var model = new LlamaCppModel();
model.Load(@"ggml-model-13b-Q8_0.bin", modelOptions);
// Configure some prediction options
var generateOptions = new LlamaCppGenerateOptions
{
ThreadCount = 4,
TopK = 40,
TopP = 0.95f,
Temperature = 0.1f,
RepeatPenalty = 1.1f,
Mirostat = Mirostat.Mirostat2,
// ...
};
// Create conversation session
var session = model.CreateSession();
while (true)
{
// Get a prompt
Console.Write("> ");
var prompt = Console.ReadLine();
// Quit on blank prompt
if (String.IsNullOrWhiteSpace(prompt))
break;
// Set-up prompt using template
prompt = String.Format(template, prompt);
// Generate tokens
await foreach (var token in session.GenerateTokenStringAsync(prompt, generateOptions))
Console.Write(token);
}
GET /model/list
GET /model/load?modelName={modelName}&modelOptions={modelOptions}
GET /model/unload
GET /model/status
GET /model/tokenize?prompt={prompt}
GET /model/reset
GET /session/create
GET /session/list
GET /session/get
GET /session/reset
POST /model/generate [RequestBody]
- Dynamic model loading
- Expose minimal API
- Support Windows/Linux
- Support BERT
ggerganov/llama.cpp for the LLaMA implementation in C++
skeskinen/bert.cpp for BERT support