C Transformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models
Installation
Usage
Documentation
License

Supported Models

Models	Model Type
GPT-2	`gpt2`
GPT-J, GPT4All-J	`gptj`
GPT-NeoX, StableLM	`gpt_neox`
LLaMA	`llama`
MPT	`mpt`
Dolly V2	`dolly-v2`
StarCoder, StarChat	`starcoder`
Falcon (Experimental)	`falcon`

Installation

pip install ctransformers

For GPU (CUDA) support, set environment variable CT_CUBLAS=1 and install from source using:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

Show commands for Windows

On Windows PowerShell run:

$env:CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers

On Windows Command Prompt run:

set CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2')

print(llm('AI is going to'))

Run in Google Colab

If you are getting illegal instruction error, try using lib='avx' or lib='basic':

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2', lib='avx')

It provides a generator interface for more control:

tokens = llm.tokenize('AI is going to')

for token in llm.generate(tokens):
    print(llm.detokenize(token))

It can be used with a custom or Hugging Face tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokens = tokenizer.encode('AI is going to')

for token in llm.generate(tokens):
    print(tokenizer.decode(token))

It also provides access to the low-level C API. See Documentation section below.

Hugging Face Hub

It can be used with models hosted on the Hub:

llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml')

If a model repo has multiple model files (.bin files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml', model_file='ggml-model.bin')

It can be used with your own models uploaded on the Hub. For better user experience, upload only one model per repo.

To use it with your own model, add config.json file to your model repo specifying the model_type:

{
  "model_type": "gpt2"
}

You can also specify additional parameters under task_specific_params.text-generation.

See marella/gpt-2-ggml for a minimal example and marella/gpt-2-ggml-example for a full example.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

Note: Currently only LLaMA models have GPU support.

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-llama.bin', model_type='llama', gpu_layers=50)

Run in Google Colab

Documentation

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling.	`40`
`top_p`	`float`	The top-p value to use for sampling.	`0.95`
`temperature`	`float`	The temperature to use for sampling.	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling.	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty.	`64`
`seed`	`int`	The seed value to use for sampling tokens.	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate.	`256`
`stop`	`List[str]`	A list of sequences to stop generation when encountered.	`None`
`stream`	`bool`	Whether to stream the generated text.	`False`
`reset`	`bool`	Whether to reset the model state before generating text.	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens.	`8`
`threads`	`int`	The number of threads to use for evaluating tokens.	`-1`
`context_length`	`int`	The maximum context length to use.	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU.	`0`

Note: Currently only LLaMA and MPT models support the context_length parameter and only LLaMA models support the gpu_layers parameter.

`class` `AutoModelForCausalLM`

`classmethod` `AutoModelForCausalLM.from_pretrained`

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.
model_type: The model type.
model_file: The name of the model file in repo or directory.
config: AutoConfig object.
lib: The path to a shared library or one of avx2, avx, basic.
local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).

Returns: LLM object.

`class` `LLM`

`method` `LLM.init`

__init__(
    model_path: str,
    model_type: str,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:

model_path: The path to a model file.
model_type: The model type.
config: Config object.
lib: The path to a shared library or one of avx2, avx, basic.

`property` LLM.config

The config object.

`property` LLM.context_length

The context length of model.

`property` LLM.embeddings

The input embeddings.

`property` LLM.eos_token_id

The end-of-sequence token.

`property` LLM.logits

The unnormalized log probabilities.

`property` LLM.model_path

The path to the model file.

`property` LLM.model_type

The model type.

`property` LLM.vocab_size

The number of tokens in vocabulary.

`method` `LLM.detokenize`

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

tokens: The list of tokens.
decode: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.

`method` `LLM.embed`

embed(
    input: Union[str, Sequence[int]],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA models support embeddings.

Args:

input: The input text or list of tokens to get embeddings for.
batch_size: The batch size to use for evaluating tokens. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1

Returns: The input embeddings.

`method` `LLM.eval`

eval(
    tokens: Sequence[int],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → None

Evaluates a list of tokens.

Args:

tokens: The list of tokens to evaluate.
batch_size: The batch size to use for evaluating tokens. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1

`method` `LLM.generate`

generate(
    tokens: Sequence[int],
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.

Args:

tokens: The list of tokens to generate tokens from.
top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1
batch_size: The batch size to use for evaluating tokens. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1
reset: Whether to reset the model state before generating text. Default: True

Returns: The generated tokens.

`method` `LLM.is_eos_token`

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.

Args:

token: The token to check.

Returns: True if the token is an end-of-sequence token else False.

`method` `LLM.reset`

reset() → None

Resets the model state.

`method` `LLM.sample`

sample(
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None
) → int

Samples a token from the model.

Args:

top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1

Returns: The sampled token.

`method` `LLM.tokenize`

tokenize(text: str) → List[int]

Converts a text into list of tokens.

Args:

text: The text to tokenize.

Returns: The list of tokens.

`method` `LLM.call`

__call__(
    prompt: str,
    max_new_tokens: Optional[int] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    stop: Optional[Sequence[str]] = None,
    stream: Optional[bool] = None,
    reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.

Args:

prompt: The prompt to generate text from.
max_new_tokens: The maximum number of new tokens to generate. Default: 256
top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1
batch_size: The batch size to use for evaluating tokens. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1
stop: A list of sequences to stop generation when encountered. Default: None
stream: Whether to stream the generated text. Default: False
reset: Whether to reset the model state before generating text. Default: True

Returns: The generated text.

License

MIT

dbpprt/ctransformers

C Transformers

Supported Models

Installation

Usage

Hugging Face Hub

LangChain

GPU

Documentation

Config

class AutoModelForCausalLM

classmethod AutoModelForCausalLM.from_pretrained

class LLM

method LLM.__init__

property LLM.config

property LLM.context_length

property LLM.embeddings

property LLM.eos_token_id

property LLM.logits

property LLM.model_path

property LLM.model_type

property LLM.vocab_size

method LLM.detokenize

method LLM.embed

method LLM.eval

method LLM.generate

method LLM.is_eos_token

method LLM.reset

method LLM.sample

method LLM.tokenize

method LLM.__call__

License

`class` `AutoModelForCausalLM`

`classmethod` `AutoModelForCausalLM.from_pretrained`

`class` `LLM`

`method` `LLM.init`

`property` LLM.config

`property` LLM.context_length

`property` LLM.embeddings

`property` LLM.eos_token_id

`property` LLM.logits

`property` LLM.model_path

`property` LLM.model_type

`property` LLM.vocab_size

`method` `LLM.detokenize`

`method` `LLM.embed`

`method` `LLM.eval`

`method` `LLM.generate`

`method` `LLM.is_eos_token`

`method` `LLM.reset`

`method` `LLM.sample`

`method` `LLM.tokenize`

`method` `LLM.call`