MHLE-RAG: Multiscale, Hierarchical & Layered Embeddings for RAG

Overview

MHLE-RAG (MuLeRAG) is a prototype designed to parse, analyze, and query codebases across multiple programming languages at various scales. It leverages Tree-sitter for parsing and uses embedding-based search to enable intelligent code querying and augmented generation.

Key Features

Multiscale Analysis 🔍: Examines code at repository, file, class, and function levels.
Hierarchical Processing 🏗️: Recognizes and utilizes the structured nature of code repositories.
Layered Embeddings 🧩: Creates rich, contextual embeddings that capture code semantics at multiple granularities.
Multi-Language Support 🌐: Parses and analyzes code in Java, Kotlin, JavaScript, Go, Python, C++, C, and Swift.
Intelligent Querying 🤖: Allows natural language queries to find relevant code snippets across the codebase.
Augmented Generation 🚀: Utilizes retrieved context to enhance code generation capabilities.
Dependency Analysis 🕸️: Generates comprehensive dependency graphs at various scales.
Requirements Integration 📝: Optionally processes and integrates software requirements for holistic analysis.

Components

Tree-sitter Integration 🌳: Uses Tree-sitter grammars for accurate code parsing.
Multiscale AST Traversers 🛠️: Custom-written for each supported language to extract relevant code information at multiple levels.
Layered Embedding Generation 📚: Utilizes specified embedding models for hierarchical code representation.
Retrieval Augmented Query Engine 🔍: Implements similarity search on layered embeddings for efficient and context-aware code retrieval.
Multiscale Graph Generation 🗺️: Creates JSON representations of code dependencies at various levels of granularity.

Setup

Initialize Tree-sitter grammars:

python grammar_utils/language_grammar_builder.py

Install required Python packages:
```
pip install -r requirements.txt
```
Configure the Ollama backend or adjust the EMBEDDING_API_URL and LLM_API_URL as needed.

Usage

Process a codebase:

python mhle_rag.py process --root_dir /path/to/your/codebase

(Optional) Process requirements:

python mhle_rag.py process_requirements --requirements_csv /path/to/requirements.csv

Query the processed codebase:
```
python mhle_rag.py query
```

Key Files

mhle_rag.py: Main script for processing, querying, and generation.
grammar_utils/ast_traversers.py: Contains language-specific multiscale AST traversal logic.
assets/: Directory where processed data (embeddings, multiscale graphs) is stored.

Customization

Extend LANGUAGE_DATA in the main script to add or modify supported languages.
Adjust embedding models by modifying CODE_EMBEDDING_MODEL and REQUIREMENT_EMBEDDING_MODEL.

Advanced Features

Hierarchical Querying 🏙️: Implements a multi-level approach to code retrieval, considering repo, file, class, and function levels.
Dynamic Multiscale Graph Building 🖼️: Constructs graphs of query results to visualize code relationships across different scales.
Context-Aware Extended Retrieval 🔎: Uses hierarchical dependency information to intelligently broaden the search scope.
Augmented Code Generation 💡: Leverages retrieved context to generate or suggest code improvements.

Notes

Ensure sufficient computational resources and disk space for processing and storing multiscale embeddings and hierarchical data.
The tool's effectiveness scales with the quality of the embedding models and the structure of your codebase.

itsPreto/MhLe.RAG