A python based document indexing and search tool that combines dense and sparse vector embeddings for different search methods (vector, hybrid) using Qdrant as the vector database backend and streamlit for browser frontend.
Provides an integrating framework for document indexing and retrieval making use of:
- Dense Embeddings: For semantic similarity using MLX and transformer models (BERT, BGE, E5, etc)
- Sparse Embeddings: Using SPLADE for lexical search / token-based retrieval with term expansion
- Hybrid Search: Combines dense and sparse approaches for optimal results
- MLX Acceleration: Uses Apple's MLX for efficient embedding generation on Apple Silicon
- Document Processing: Support for TXT, MD, HTML, PDF, JSON, CSV files, chunking with token limit enforcement, basic Metadata extraction and payload storage
A streamlit-based frontend user interface enables indexing and searching via browser.
- Python 3.7+
- Qdrant (either local or remote server)
- Optional but recommended:
- MLX (for Apple Silicon acceleration)
- mlx_embedding_models (for enhanced model support)
- PyTorch (fallback when MLX is not available)
- psutil (for memory monitoring)
- tqdm (for progress bars)
- streamlit
# Basic installation
pip install qdrant-client transformers
# Recommended additional packages
pip install mlx mlx_embedding_models PyPDF2 tqdm psutil torch streamlit
A streamlit-based search interface is available to simplify interactive exploration and retrieval from indexed documents.
streamlit run mlxrag_ui.py
python mlxrag.py index documents_directory \
--use-mlx-models \
--dense-model bge-small \
--sparse-model distilbert-splade \
--collection my_documents
# Hybrid search (both dense and sparse)
python mlxrag.py search "your search query" \
--search-type hybrid \
--use-mlx-models
# Sparse-only search for lexical matching
python mlxrag.py search "exact terms to match" \
--search-type sparse
# Dense vector search for semantic similarity
python mlxrag.py search "semantic concept" \
--search-type vector
python mlxrag.py list-models
--verbose
: Enable detailed logging--storage-path
: Path for local Qdrant storage
directory
: Directory containing documents to index--include
: File patterns to include (e.g., "*.pdf *.txt")--limit
: Maximum number of files to index--host
: Qdrant host (default: localhost)--port
: Qdrant port (default: 6333)--collection
: Collection name--model
: Fallback model name--weights
: Path to model weights--recreate
: Recreate collection if it exists--dense-model
: MLX dense embedding model name--sparse-model
: MLX sparse embedding model name--top-k
: Top-k tokens for sparse vectors--custom-repo-id
: Custom model repository ID--custom-ndim
: Custom embedding dimension--custom-pooling
: Custom pooling strategy (mean, first, max)--custom-normalize
: Normalize embeddings--custom-max-length
: Custom max sequence length
query
: Search query--search-type
: Type of search (hybrid, vector, sparse, keyword)--limit
: Maximum number of results--prefetch-limit
: Prefetch limit for hybrid search--fusion
: Fusion strategy (rrf, dbsf)--relevance-tuning
: Apply relevance tuning--context-size
: Size of context window for preview--score-threshold
: Minimum score threshold for results--debug
: Show detailed debug information--no-color
: Disable colored output
- bge-micro: 3 layers, 384-dim
- gte-tiny: 6 layers, 384-dim
- minilm-l6: 6 layers, 384-dim
- bge-small: 12 layers, 384-dim
- bge-base: 12 layers, 768-dim
- bge-large: 24 layers, 1024-dim
- snowflake-lg: 24 layers, 1024-dim
- distilbert-splade: 6 layers
- neuralcherche-sparse-embed: 6 layers
- opensearch: 6 layers
- naver-splade-distilbert: 6 layers
# Index a directory with both dense and sparse embeddings
python mlxrag.py index ~/documents \
--include "*.pdf *.txt *.md" \
--use-mlx-models \
--dense-model bge-small \
--sparse-model distilbert-splade \
--collection docs_collection \
--verbose
# Run a hybrid search with relevance tuning
python mlxrag.py search "quantum computing applications" \
--search-type hybrid \
--use-mlx-models \
--dense-model bge-small \
--sparse-model distilbert-splade \
--collection docs_collection \
--limit 15 \
--prefetch-limit 50 \
--fusion rrf \
--context-size 400 \
--relevance-tuning
# Index using a custom model from Hugging Face
python mlxrag.py index ~/documents \
--use-mlx-models \
--custom-repo-id "my-org/my-custom-model" \
--custom-ndim 768 \
--custom-pooling first \
--collection custom_collection
-
Document Processing:
- Files are extracted based on their format
- Text is split into chunks respecting token limits
- Each chunk is processed for embedding
-
Vector Generation:
- Dense vectors capture semantic meaning
- Sparse vectors capture lexical information with term expansion
- Vectors are optimized for storage efficiency
-
Indexing:
- Vectors are stored in Qdrant with appropriate configuration
- Metadata is preserved for filtering and display
-
Search:
- Queries are processed similarly to documents
- Various search strategies can be employed
- Results are ranked and formatted for display
This implementation uses SPLADE (Sparse Lexical and Expansion Model) for generating sparse vectors, which provides several advantages over simple bag-of-words approaches:
- Term Expansion: Includes semantically related terms not in the original text
- Learned Weights: Assigns importance to terms based on context
- Efficient Storage: Only non-zero values are stored
- Interpretable Results: Each dimension corresponds to a specific token
MIT
This tool builds upon several open-source projects:
- Qdrant for vector storage and search
- MLX for efficient embedding computation (esp. mlx-examples, where we modified the BERT approach: https://github.com/CrispStrobe/mlx-examples/tree/main/bert)
- SPLADE for sparse vector generation
- Transformers for model loading and tokenization
- mlx_embedding_models