This project implements a secure HTTPS server for running private inference sessions on Large Language Models (LLMs) using Rust and Actix Web.
- Secure HTTPS server with SSL/TLS support
- CORS-enabled API endpoints
- Configurable LLM model loading and inference
- Health check endpoint
- Server information endpoint
- Generate text based on prompts
- Support for multiple model architectures
- Rust programming environment
- OpenSSL library
- SSL certificates (cert.pem and key.pem)
The server is configurable through environment variables or a configuration file. Key configuration options include:
- LLM model path and architecture
- Server address and port
- Allowed origin for CORS
- Maximum token count for inference
Create a file named .env
in the root directory of your project with the following content:
SERVER_ADDRESS=localhost
SERVER_PORT=8080
SERVER_REQUEST_TIMEOUT_IN_SECONDS=10
MACHINE_COMMAND_TIMEOUT_IN_SECONDS=10
DATABASE_URL=""
MAX_CONNECTIONS=10
ALLOWED_ORIGIN="localhost"
MAX_AGE=4600
LLM_MODEL="open_llama_7b-f16.bin"
LLM_MODEL_ARCHITECTURE="llama"
LLM_INFERENCE_MAX_TOKEN_COUNT=400
GET /
: Server informationPOST /api/generate
: Generate text based on a promptGET /api/health
: Health check endpoint
To build the project, run:
cargo build --release --features serve
If everything was configured correctly you should be greeted with something like:
...
Loaded tensor 288/291
Loading of model complete
Model size = 12853.47 MB / num tensors = 291
open_llama_7b-f16.bin model (llama) has been started!
Elapsed: 96ms
Starting server at https://localhost:8080.
Starting the inference session
curl -k -X POST -H "Content-Type: application/json" -d '{"prompt": "Say hello!"}' https://localhost:8080/api/generate
Performance and outputs depend significantly on the model size and hardware capabilities. Below are estimated performance metrics for Llama models on various MacBook configurations:
Model Size | Hardware | Approx. Inference Time | Memory Usage | Max Tokens/Second |
---|---|---|---|---|
7B | M1 Air (8GB) | 150-200ms/token | ~14GB | ~5-6 |
7B | M1 Pro (16GB) | 120-170ms/token | ~14GB | ~6-8 |
7B | M2 Air (16GB) | 100-150ms/token | ~14GB | ~7-10 |
7B | M2 Pro (32GB) | 80-130ms/token | ~14GB | ~8-12 |
13B | M1 Pro (32GB) | 250-350ms/token | ~28GB | ~3-4 |
13B | M2 Pro (32GB) | 200-300ms/token | ~28GB | ~3-5 |
Note: These figures are approximate and can vary based on specific configurations, prompt length, optimization techniques, and other factors. Always benchmark on your specific setup for accurate performance metrics.
Key Observations:
- Larger models (13B, 30B) require more memory and process fewer tokens per second.
- Newer M-series chips (M2, M3) generally offer better performance for the same model size.
- Models larger than 7B may not be practical on MacBooks with less than 32GB of RAM.
- The 30B model is only feasible on high-end configurations with significant RAM.
The quality and coherence of outputs generally improve with larger models, but this comes at the cost of increased computational requirements and slower inference times. Users should balance their specific needs for performance, output quality, and available hardware when selecting a model.
Note: These figures are approximate and can vary based on specific hardware configurations, prompt length, and other factors. Always benchmark on your specific setup for accurate performance metrics.
This server implements several security measures:
HTTPS: All communication is encrypted using SSL/TLS. CORS: Cross-Origin Resource Sharing is configured to restrict access to specified origins. Middleware: The server uses Actix Web's middleware for logging and compression. SSL/TLS Configuration: The server uses Mozilla's intermediate configuration for SSL/TLS settings.
Always ensure that your SSL certificates are up-to-date and properly configured. Keep your private keys secure and never commit them to version control.
This server supports multiple model architectures. The specific architecture can be configured in the settings. Supported architectures are determined by the match_model_architecture function in the utils module. An exhaustive list would be:
- Llama
- Bloom
- GPT2
- GptJ
- NeoX
Contributions to this project are welcome! Here's how you can contribute:
Fork the repository Create a new branch for your feature or bug fix Make your changes and commit them with clear, descriptive messages Push your changes to your fork Submit a pull request to the main repository
Please ensure your code adheres to the project's coding standards and include tests for new features or bug fixes.
- Add a test suite covering server and utilities;
- Add Docker support;
- Add inference cancelling support;
- Add database support for requests history;
- Add stream support for tokens;
IMPORTANT: READ BEFORE USE
- This software is provided "as is", without warranty of any kind.
- The authors are not liable for any damages or consequences arising from its use.
- Users are solely responsible for their use of this software and compliance with applicable laws.
- This software is not intended for use in critical systems where failure could cause harm.
- The authors do not support any illegal or unethical use of this software.
- Performance and outputs of language models may be unpredictable.
- Users are responsible for security and should thoroughly test any deployments.
USE OF THIS SOFTWARE INDICATES ACCEPTANCE OF THESE TERMS.