# Docs MCP
GitHub repository analysis and query system with Model Context Protocol server capabilities.
## Overview
Docs MCP ingests GitHub repositories and documentation websites, processes them into semantic chunks, stores them in a vector database, and exposes an intelligent query interface via MCP server. Perfect for understanding large codebases and their documentation through natural language queries.
## Features
- **Repository Ingestion**: Convert GitHub repos to searchable documents using GitIngest
- **Documentation Scraping**: Extract content from documentation websites via sitemap
- **Semantic Chunking**: Split content into overlapping chunks for better vector search
- **Vector Storage**: Persistent Chroma database with OpenAI embeddings
- **Directory Summaries**: AI-generated overviews of repository structure
- **Dual Search**: Independent search across code and documentation content
- **MCP Server**: Model Context Protocol server for external integration
## Prerequisites
- Python >=3.10
- OpenAI API key
## Quick Start
### 1. Setup
```bash
# Clone the repository
git clone
cd docs-mcp
# Install dependencies
uv sync
# Create environment file with your OpenAI API key
echo "OPENAI_API_KEY=your-api-key-here" > .env
```
### 2. Analyze a GitHub Repository
```bash
# Basic repository ingestion
uv run python main.py ingest --repo https://github.com/username/repository
# With documentation (if the project has docs)
uv run python main.py ingest --repo https://github.com/username/repository --docs-url https://docs.example.com
# For private repositories
uv run python main.py ingest --repo https://github.com/username/private-repo --token your-github-token
```
### 3. Query the Data
```bash
# Interactive query interface
uv run python query_db.py
```
### 4. Start MCP Server
```bash
# Start the server for external MCP clients
uv run python main.py serve
# Or run the full pipeline (ingest + serve)
uv run python main.py run --repo https://github.com/username/repository
```
## Example Workflow
Here's how to analyze a popular open-source project:
```bash
# 1. Ingest the FastAPI repository with documentation
uv run python main.py ingest \
--repo https://github.com/tiangolo/fastapi \
--docs-url https://fastapi.tiangolo.com
# 2. Query the ingested data
uv run python query_db.py
# Try queries like:
# - "How do I create API endpoints?"
# - "What are the main components of FastAPI?"
# - "How does dependency injection work?"
# 3. Start MCP server for integration
uv run python main.py serve --host 0.0.0.0 --port 8000
```
## CLI Commands
### `ingest` - Ingest Repository and Documentation
```bash
uv run python main.py ingest [OPTIONS]
Options:
--repo TEXT GitHub repository URL or local path [required]
--docs-url TEXT Documentation website URL (optional)
--token TEXT GitHub token for private repositories
--collection TEXT Vector store collection name [default: project]
```
### `serve` - Start MCP Server
```bash
uv run python main.py serve [OPTIONS]
Options:
--host TEXT Host to bind the server to [default: 0.0.0.0]
--port INTEGER Port to bind the server to [default: 8000]
```
### `docs` - Ingest Documentation Only
```bash
uv run python main.py docs [OPTIONS]
Options:
--docs-url TEXT Documentation website URL [required]
--collection TEXT Vector store collection name [default: project]
```
### `run` - Full Pipeline (Ingest + Serve)
```bash
uv run python main.py run [OPTIONS]
Options:
--repo TEXT GitHub repository URL or local path [required]
--docs-url TEXT Documentation website URL (optional)
--token TEXT GitHub token for private repositories
--collection TEXT Vector store collection name [default: project]
--host TEXT Host to bind the server to [default: 0.0.0.0]
--port INTEGER Port to bind the server to [default: 8000]
```
## MCP Server Installation & Integration
### Installing as MCP Server
To use Docs MCP as a Model Context Protocol server with compatible clients (like Claude Desktop), follow these steps:
1. **Ingest Your Data First**:
```bash
# Ingest the repository you want to analyze
uv run python main.py ingest --repo https://github.com/username/repository
# Or with documentation
uv run python main.py ingest --repo https://github.com/username/repository --docs-url https://docs.example.com
```
2. **Configure Claude Desktop**:
Add to your Claude Desktop configuration file (`claude_desktop_config.json`):
```json
{
"mcpServers": {
"docs-mcp": {
"command": "uv",
"args": [
"run",
"--python",
"/path/to/docs-mcp/.venv/bin/python",
"/path/to/docs-mcp/mcp_standalone.py"
],
"cwd": "/path/to/docs-mcp",
"env": {
"ANTHROPIC_API_KEY": "your-anthropic-api-key-here",
"OPENAI_API_KEY": "your-openai-api-key-here"
}
}
}
}
```
**Important Notes**:
- Replace `/path/to/docs-mcp` with the actual path to your cloned repository
- The `--python` flag ensures the correct virtual environment Python is used
- The MCP server runs automatically when Claude Desktop starts - no separate server startup needed
- Communication happens over stdio (standard input/output), not HTTP
3. **Required API Keys**:
- `ANTHROPIC_API_KEY` - For the Claude LLM used by the agent
- `OPENAI_API_KEY` - For the OpenAI embeddings used by the vector store
4. **Restart Claude Desktop** after updating the configuration file
### How It Works
The MCP server uses stdio (standard input/output) communication with Claude Desktop:
- Claude Desktop launches `mcp_standalone.py` as a subprocess
- The server communicates via JSON messages over stdin/stdout
- No network ports or HTTP servers are involved
- The server automatically loads your ingested repository data from the vector database
### MCP Tools Available
When connected via MCP, the server provides these tools:
- **search_repo**: Search through repository code and files
- **search_docs**: Search through documentation content
Example MCP tool usage:
```json
{
"tool": "search_repo",
"arguments": {
"query": "authentication implementation"
}
}
```
## Project Structure
```
docs-mcp/
� main.py # CLI entry point
� config.py # Configuration settings
� query_db.py # Interactive query tool
� ingestion/ # Data ingestion modules
� � repo_ingestor.py # GitIngest integration
� � docs_scraper.py # Documentation scraping
� processing/ # Text processing
� � chunker.py # Semantic text chunking
� vectordb/ # Vector database operations
� � vector_store.py # Chroma vector store
� summaries/ # Directory summarization
� � dir_summarizer.py # AI-generated overviews
� agent/ # LangChain agent system
� � agent_builder.py # Agent with search tools
� server/ # MCP server implementation
� mcp_server.py # FastAPI server
```
## Data Storage
- `data/vectordb/` - Persistent Chroma vector database collections
- `data/summaries/` - Generated directory overview markdown files
## Development
```bash
# Lint code
uv run ruff check .
# Auto-fix linting issues
uv run ruff check . --fix --unsafe-fixes
# Format code
uv run ruff format .
# Type check
uv run mypy .
```
## Environment Variables
Create a `.env` file in the project root:
```bash
OPENAI_API_KEY=your-openai-api-key-here
ANTHROPIC_API_KEY=your-anthropic-api-key-here
```
**Required API Keys**:
- `OPENAI_API_KEY` - Used for vector embeddings via OpenAI's text-embedding-3-small model
- `ANTHROPIC_API_KEY` - Used for the Claude LLM (claude-3-5-sonnet-20241022) in the agent
## Troubleshooting
### Common Issues
1. **Python Version**: Ensure you're using Python >=3.10 (required by onnxruntime dependency)
2. **API Keys**: Make sure both OpenAI and Anthropic API keys are set in the `.env` file
3. **GitHub Rate Limits**: For large repositories or frequent ingestion, consider using a GitHub token
4. **Memory Usage**: Large repositories may require significant RAM for processing
### Getting Help
- Check the logs for detailed error messages
- Use `uv run python query_db.py` to verify data was ingested correctly
- Test the MCP server with `curl` before integrating with external clients
## License
[Your License Here]