VLLaMA is a convenient command-line interface for vLLM that provides an Ollama-like experience with high-performance vLLM backend. Supports multi-GPU configurations and automatic memory optimization.
- Ollama-like interface - familiar
pull,list,serve,rmcommands - Multi-GPU support - automatic utilization of multiple GPUs
- Memory optimization - intelligent memory distribution across GPUs
- OpenAI-compatible API - full compatibility with OpenAI clients
- Automatic error handling - workarounds for torch.compile and libcuda issues
- Easy to use - single command to download and run models
- Python 3.8+
- CUDA-compatible GPU (multi-GPU configurations supported)
- vLLM 0.4.0+
# Install vLLM
pip install vllm
# Clone the repository
git clone https://github.com/kshvakov/vllama.git
cd vllama
# Copy the script to your PATH
sudo cp vllama /usr/local/bin/
sudo chmod +x /usr/local/bin/vllama
# Create configuration directory
mkdir -p ~/.vllama
# Create default configuration
cat > ~/.vllama/config << 'EOF'
VLLM_GPU_MEMORY_UTILIZATION=0.85
VLLM_MAX_MODEL_LEN=8192
VLLM_TENSOR_PARALLEL_SIZE=2
VLLM_GPU_DEVICES="0,1"
VLLM_DISABLE_COMPILE=1
EOF# Download manually and add to your PATH
wget https://github.com/kshvakov/vllama/raw/main/vllama
chmod +x vllama
sudo mv vllama /usr/local/bin/
# Or run directly from the cloned directory
./vllama --help# Download a model
vllama pull mistralai/Mistral-7B-Instruct-v0.1
# List available models
vllama list
# Start API server
vllama serve mistralai-mistral-7b-instruct-v0.1
# Stop server
vllama stop
# Check status
vllama status
# Remove model from registry
vllama rm mistralai-mistral-7b-instruct-v0.1# 1. Download model
vllama pull codellama/CodeLlama-7b-Instruct-hf
# 2. Check available models
vllama list
# 3. Start server
vllama serve codellama-codellama-7b-instruct-hf
# 4. Check status (in another terminal)
vllama status
# 5. Test API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama-codellama-7b-instruct-hf",
"prompt": "def fibonacci(n):",
"max_tokens": 100
}'
# 6. Stop server
vllama stopConfiguration file is located at ~/.vllama/config:
# GPU memory settings (0.1-1.0)
VLLM_GPU_MEMORY_UTILIZATION=0.85
# Maximum context length
VLLM_MAX_MODEL_LEN=8192
# Number of GPUs for tensor parallelism
VLLM_TENSOR_PARALLEL_SIZE=2
# Which GPUs to use (comma-separated)
VLLM_GPU_DEVICES="0,1"
# Disable torch.compile to fix libcuda errors
VLLM_DISABLE_COMPILE=1
# Quantization (awq, gptq) - optional
# VLLM_QUANTIZATION=awqvllama configvllama gpu-info# 7B models - optimal for 2x12GB
vllama pull mistralai/Mistral-7B-Instruct-v0.1
vllama pull meta-llama/Llama-2-7b-chat-hf
# 13B models - work well with quantization
vllama pull codellama/CodeLlama-13b-Instruct-hf
# Code generation
vllama pull codellama/CodeLlama-7b-Python-hf# Change config for single GPU
echo 'VLLM_TENSOR_PARALLEL_SIZE=1' >> ~/.vllama/config
echo 'VLLM_GPU_DEVICES="0"' >> ~/.vllama/config
vllama pull microsoft/DialoGPT-medium
vllama pull mistralai/Mistral-7B-v0.1If you encounter cannot find -lcuda error, VLLaMA automatically disables problematic optimizations:
# Check libcuda availability
vllama serve your-model # automatic fix enabled
# Or install manually
sudo apt install nvidia-cuda-toolkit libcuda1VLLaMA automatically detects available memory and adjusts parameters:
# Automatic configuration on startup
vllama serve large-model
# Manual configuration via config
echo "VLLM_GPU_MEMORY_UTILIZATION=0.7" >> ~/.vllama/config# Check GPU configuration
vllama gpu-info
# Force single GPU usage
VLLM_TENSOR_PARALLEL_SIZE=1 VLLM_GPU_DEVICES="0" vllama serve model-nameWhen server is running, the following endpoints are available:
http://localhost:8000/v1/completions- Text completionhttp://localhost:8000/v1/chat/completions- Chat completionhttp://localhost:8000/v1/models- List modelshttp://localhost:8000/docs- OpenAPI documentation
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123"
)
response = client.completions.create(
model="your-model-name",
prompt="Explain AI in simple terms:",
max_tokens=100
)
print(response.choices[0].text)| Feature | Ollama | VLLaMA + vLLM |
|---|---|---|
| Performance | 🟡 Average | 🟢 High (2-4x) |
| Memory | 🟡 Static allocation | 🟢 Dynamic (PagedAttention) |
| Multi-GPU | ❌ No | 🟢 Full support |
| API | 🔵 Custom format | 🟢 OpenAI-compatible |
| Simplicity | 🟢 Very simple | 🟢 Simple |
Mistral-7B-Instruct:
- Throughput: 45 tokens/seconds
- Parallel requests: 16
- Memory usage: 6.7GB per GPU
CodeLlama-13B (with quantization):
- Throughput: 28 tokens/seconds
- Parallel requests: 8
- Memory usage: 8.1GB per GPU
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is distributed under the MIT License. See LICENSE file for details.
- vLLM - High-performance LLM inference backend
- Ollama - Inspiration for user interface
- Hugging Face - Models and infrastructure
VLLaMA - Combining Ollama's simplicity with vLLM's power! 🚀