VLLaMA - Ollama-like Interface for vLLM

VLLaMA is a convenient command-line interface for vLLM that provides an Ollama-like experience with high-performance vLLM backend. Supports multi-GPU configurations and automatic memory optimization.

🚀 Features

Ollama-like interface - familiar pull, list, serve, rm commands
Multi-GPU support - automatic utilization of multiple GPUs
Memory optimization - intelligent memory distribution across GPUs
OpenAI-compatible API - full compatibility with OpenAI clients
Automatic error handling - workarounds for torch.compile and libcuda issues
Easy to use - single command to download and run models

📦 Installation

Requirements

Python 3.8+
CUDA-compatible GPU (multi-GPU configurations supported)
vLLM 0.4.0+

Quick Installation

# Install vLLM
pip install vllm

# Clone the repository
git clone https://github.com/kshvakov/vllama.git
cd vllama

# Copy the script to your PATH
sudo cp vllama /usr/local/bin/
sudo chmod +x /usr/local/bin/vllama

# Create configuration directory
mkdir -p ~/.vllama

# Create default configuration
cat > ~/.vllama/config << 'EOF'
VLLM_GPU_MEMORY_UTILIZATION=0.85
VLLM_MAX_MODEL_LEN=8192
VLLM_TENSOR_PARALLEL_SIZE=2
VLLM_GPU_DEVICES="0,1"
VLLM_DISABLE_COMPILE=1
EOF

Alternative Installation

# Download manually and add to your PATH
wget https://github.com/kshvakov/vllama/raw/main/vllama
chmod +x vllama
sudo mv vllama /usr/local/bin/

# Or run directly from the cloned directory
./vllama --help

🛠️ Usage

Basic Commands

# Download a model
vllama pull mistralai/Mistral-7B-Instruct-v0.1

# List available models
vllama list

# Start API server
vllama serve mistralai-mistral-7b-instruct-v0.1

# Stop server
vllama stop

# Check status
vllama status

# Remove model from registry
vllama rm mistralai-mistral-7b-instruct-v0.1

Complete Workflow

# 1. Download model
vllama pull codellama/CodeLlama-7b-Instruct-hf

# 2. Check available models
vllama list

# 3. Start server
vllama serve codellama-codellama-7b-instruct-hf

# 4. Check status (in another terminal)
vllama status

# 5. Test API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama-codellama-7b-instruct-hf",
    "prompt": "def fibonacci(n):",
    "max_tokens": 100
  }'

# 6. Stop server
vllama stop

⚙️ Configuration

Configuration file is located at ~/.vllama/config:

# GPU memory settings (0.1-1.0)
VLLM_GPU_MEMORY_UTILIZATION=0.85

# Maximum context length
VLLM_MAX_MODEL_LEN=8192

# Number of GPUs for tensor parallelism
VLLM_TENSOR_PARALLEL_SIZE=2

# Which GPUs to use (comma-separated)
VLLM_GPU_DEVICES="0,1"

# Disable torch.compile to fix libcuda errors
VLLM_DISABLE_COMPILE=1

# Quantization (awq, gptq) - optional
# VLLM_QUANTIZATION=awq

View Current Configuration

vllama config

GPU Information

vllama gpu-info

🎯 Recommended Models

For 2x12GB GPUs

# 7B models - optimal for 2x12GB
vllama pull mistralai/Mistral-7B-Instruct-v0.1
vllama pull meta-llama/Llama-2-7b-chat-hf

# 13B models - work well with quantization
vllama pull codellama/CodeLlama-13b-Instruct-hf

# Code generation
vllama pull codellama/CodeLlama-7b-Python-hf

For Single GPU

# Change config for single GPU
echo 'VLLM_TENSOR_PARALLEL_SIZE=1' >> ~/.vllama/config
echo 'VLLM_GPU_DEVICES="0"' >> ~/.vllama/config

vllama pull microsoft/DialoGPT-medium
vllama pull mistralai/Mistral-7B-v0.1

🔧 Troubleshooting

libcuda Error

If you encounter cannot find -lcuda error, VLLaMA automatically disables problematic optimizations:

# Check libcuda availability
vllama serve your-model  # automatic fix enabled

# Or install manually
sudo apt install nvidia-cuda-toolkit libcuda1

Memory Issues

VLLaMA automatically detects available memory and adjusts parameters:

# Automatic configuration on startup
vllama serve large-model

# Manual configuration via config
echo "VLLM_GPU_MEMORY_UTILIZATION=0.7" >> ~/.vllama/config

Multi-GPU Issues

# Check GPU configuration
vllama gpu-info

# Force single GPU usage
VLLM_TENSOR_PARALLEL_SIZE=1 VLLM_GPU_DEVICES="0" vllama serve model-name

🌐 API Endpoints

When server is running, the following endpoints are available:

http://localhost:8000/v1/completions - Text completion
http://localhost:8000/v1/chat/completions - Chat completion
http://localhost:8000/v1/models - List models
http://localhost:8000/docs - OpenAPI documentation

Example with OpenAI Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"
)

response = client.completions.create(
    model="your-model-name",
    prompt="Explain AI in simple terms:",
    max_tokens=100
)

print(response.choices[0].text)

📊 Performance

Comparison with Ollama

Feature	Ollama	VLLaMA + vLLM
Performance	🟡 Average	🟢 High (2-4x)
Memory	🟡 Static allocation	🟢 Dynamic (PagedAttention)
Multi-GPU	❌ No	🟢 Full support
API	🔵 Custom format	🟢 OpenAI-compatible
Simplicity	🟢 Very simple	🟢 Simple

Benchmarks (2x12GB GPUs)

Mistral-7B-Instruct:
- Throughput: 45 tokens/seconds
- Parallel requests: 16
- Memory usage: 6.7GB per GPU

CodeLlama-13B (with quantization):
- Throughput: 28 tokens/seconds  
- Parallel requests: 8
- Memory usage: 8.1GB per GPU

🤝 Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is distributed under the MIT License. See LICENSE file for details.

⭐ Acknowledgments

vLLM - High-performance LLM inference backend
Ollama - Inspiration for user interface
Hugging Face - Models and infrastructure

VLLaMA - Combining Ollama's simplicity with vLLM's power! 🚀