inferencemax is a flexible library for text generation using the max engine. It provides a streamlined workflow for loading, exporting, and running inference on various model architectures.
inferencemax/
├── data/
│ ├── __init__.py
│ ├── export.py
│ ├── hf.py
│ ├── load.py
│ └── onnx.py
├── utils/
│ ├── __init__.py
│ ├── decorators.py
│ └── logger.py
├── __init__.py
├── generator.py
├── initializer.py
├── kv_cache.py
├── sampler.py
├── text_generation.py
└── tokenizer.py
- Support for loading models from HuggingFace and ONNX formats
- Efficient model export to ONNX format
- Customizable text generation pipeline
- KV-cache support for improved inference speed
- Flexible sampling strategies (temperature, top-k)
- Comprehensive logging and timing decorators
Here's a basic example of how to use InferenceMax:
from inferencemax.data.load import load_model, load_tokenizer
from inferencemax.text_generation import generate_text
# Load model and tokenizer
model_path = "path/to/your/model"
model = load_model(model_path)
tokenizer = load_tokenizer(model_path)
# Generate text
input_text = "Once upon a time"
generated_text = generate_text(model, tokenizer, input_text)
print(generated_text)
InferenceMax also provides a command-line interface for easy text generation:
python cli.py --model_path "path/to/your/model" --input_text "Once upon a time" --max_new_tokens 50
You can customize the generation parameters using a YAML configuration file:
max_new_tokens: 50
temperature: 0.8
top_k: 40
Then use it with the CLI:
python cli.py --model_path "path/to/your/model" --input_text "Once upon a time" --config_path "path/to/config.yaml"
Contributions are welcome! Please feel free to submit a Pull Request. This is not aim to replace vLLM or something else but rather a place to learn and test things.
This project is licensed under the terms of the LICENSE file in the root directory.