Graph-Code: A Graph-Based RAG System for Python Codebases

A sophisticated Retrieval-Augmented Generation (RAG) system that analyzes Python repositories, builds knowledge graphs, and enables natural language querying of codebase structure and relationships.

🚀 Features

AST-based Code Analysis: Deep parsing of Python files to extract classes, functions, methods, and their relationships
Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
Natural Language Querying: Ask questions about your codebase in plain English
AI-Powered Cypher Generation: Leverages Google Gemini to translate natural language to Cypher queries
Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
Dependency Analysis: Parses pyproject.toml to understand external dependencies

🏗️ Architecture

The system consists of two main components:

Repository Parser (repo_parser.py): Analyzes Python codebases and ingests data into Memgraph
RAG System (codebase_rag/): Interactive CLI for querying the stored knowledge graph

Core Components

Graph Database: Memgraph for storing code structure as nodes and relationships
LLM Integration: Google Gemini for natural language processing
Code Analysis: AST traversal for extracting code elements
Query Tools: Specialized tools for graph querying and code retrieval

📋 Prerequisites

Python 3.12+
Docker & Docker Compose (for Memgraph)
Google Gemini API key
uv package manager

🛠️ Installation

Clone the repository:

git clone <repository-url>
cd graph-code

Install dependencies:

uv sync

Set up environment variables:

cp .env.example .env
# Edit .env with your configuration

Required environment variables:

GEMINI_API_KEY=your-api-key
GEMINI_MODEL_ID=gemini-2.5-pro
MODEL_CYPHER_ID=gemini-2.5-flash-lite-preview-06-17
MEMGRAPH_HOST=localhost
MEMGRAPH_PORT=7687

Start Memgraph database:

docker-compose up -d

🎯 Usage

Step 1: Parse a Repository

Parse and ingest a Python repository into the knowledge graph:

python repo_parser.py /path/to/your/python/repo --clean

Options:

--clean: Clear existing data before parsing
--host: Memgraph host (default: localhost)
--port: Memgraph port (default: 7687)

Step 2: Query the Codebase

Start the interactive RAG CLI:

python -m codebase_rag.main --repo-path /path/to/your/repo

Example queries:

"Show me all classes that contain 'user' in their name"
"Find functions related to database operations"
"What methods does the User class have?"
"Show me functions that handle authentication"

📊 Graph Schema

The knowledge graph uses the following node types and relationships:

Node Types

Project: Root node representing the entire repository
Package: Python packages (directories with __init__.py)
Module: Individual Python files
Class: Class definitions
Function: Module-level functions
Method: Class methods
Folder: Regular directories
File: Non-Python files
ExternalPackage: External dependencies

Relationships

CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containment
DEFINES: Module defines classes/functions
DEFINES_METHOD: Class defines methods
DEPENDS_ON_EXTERNAL: Project depends on external packages

🔧 Configuration

Configuration is managed through environment variables and the config.py file:

MEMGRAPH_HOST = "localhost"
MEMGRAPH_PORT = 7687
GEMINI_MODEL_ID = "gemini-2.5-pro"  # Main RAG orchestrator model
MODEL_CYPHER_ID = "gemini-2.5-flash-lite-preview-06-17"  # Cypher generation model
TARGET_REPO_PATH = "."
GEMINI_API_KEY = "required"

🏃‍♂️ Development

Project Structure

graph-code/
├── repo_parser.py              # Repository analysis and ingestion
├── codebase_rag/              # RAG system package
│   ├── main.py                # CLI entry point
│   ├── config.py              # Configuration management
│   ├── prompts.py             # LLM prompts and schemas
│   ├── schemas.py             # Pydantic models
│   ├── services/              # Core services
│   │   ├── graph_db.py        # Memgraph integration
│   │   └── llm.py             # Gemini LLM integration
│   └── tools/                 # RAG tools
│       ├── codebase_query.py  # Graph querying tool
│       └── code_retrieval.py  # Code snippet retrieval
├── docker-compose.yaml        # Memgraph setup
└── pyproject.toml            # Project dependencies

Key Dependencies

pydantic-ai: AI agent framework
pymgclient: Memgraph Python client
loguru: Advanced logging
python-dotenv: Environment variable management

🐛 Debugging

Check Memgraph connection:
- Ensure Docker containers are running: docker-compose ps
- Verify Memgraph is accessible on port 7687
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
Enable debug logging:
- The RAG orchestrator runs in debug mode by default
- Check logs for detailed execution traces

🤝 Contributing

Follow the established code structure
Keep files under 100 lines (as per user rules)
Use type annotations
Follow conventional commit messages
Use DRY principles

🙋‍♂️ Support

For issues or questions:

Check the logs for error details
Verify Memgraph connection
Ensure all environment variables are set
Review the graph schema matches your expectations

jcvikl/code-graph-rag