/docs-mcp

Primary LanguagePython

# Docs MCP GitHub repository analysis and query system with Model Context Protocol server capabilities. ## Overview Docs MCP ingests GitHub repositories and documentation websites, processes them into semantic chunks, stores them in a vector database, and exposes an intelligent query interface via MCP server. Perfect for understanding large codebases and their documentation through natural language queries. ## Features - **Repository Ingestion**: Convert GitHub repos to searchable documents using GitIngest - **Documentation Scraping**: Extract content from documentation websites via sitemap - **Semantic Chunking**: Split content into overlapping chunks for better vector search - **Vector Storage**: Persistent Chroma database with OpenAI embeddings - **Directory Summaries**: AI-generated overviews of repository structure - **Dual Search**: Independent search across code and documentation content - **MCP Server**: Model Context Protocol server for external integration ## Prerequisites - Python >=3.10 - OpenAI API key ## Quick Start ### 1. Setup ```bash # Clone the repository git clone cd docs-mcp # Install dependencies uv sync # Create environment file with your OpenAI API key echo "OPENAI_API_KEY=your-api-key-here" > .env ``` ### 2. Analyze a GitHub Repository ```bash # Basic repository ingestion uv run python main.py ingest --repo https://github.com/username/repository # With documentation (if the project has docs) uv run python main.py ingest --repo https://github.com/username/repository --docs-url https://docs.example.com # For private repositories uv run python main.py ingest --repo https://github.com/username/private-repo --token your-github-token ``` ### 3. Query the Data ```bash # Interactive query interface uv run python query_db.py ``` ### 4. Start MCP Server ```bash # Start the server for external MCP clients uv run python main.py serve # Or run the full pipeline (ingest + serve) uv run python main.py run --repo https://github.com/username/repository ``` ## Example Workflow Here's how to analyze a popular open-source project: ```bash # 1. Ingest the FastAPI repository with documentation uv run python main.py ingest \ --repo https://github.com/tiangolo/fastapi \ --docs-url https://fastapi.tiangolo.com # 2. Query the ingested data uv run python query_db.py # Try queries like: # - "How do I create API endpoints?" # - "What are the main components of FastAPI?" # - "How does dependency injection work?" # 3. Start MCP server for integration uv run python main.py serve --host 0.0.0.0 --port 8000 ``` ## CLI Commands ### `ingest` - Ingest Repository and Documentation ```bash uv run python main.py ingest [OPTIONS] Options: --repo TEXT GitHub repository URL or local path [required] --docs-url TEXT Documentation website URL (optional) --token TEXT GitHub token for private repositories --collection TEXT Vector store collection name [default: project] ``` ### `serve` - Start MCP Server ```bash uv run python main.py serve [OPTIONS] Options: --host TEXT Host to bind the server to [default: 0.0.0.0] --port INTEGER Port to bind the server to [default: 8000] ``` ### `docs` - Ingest Documentation Only ```bash uv run python main.py docs [OPTIONS] Options: --docs-url TEXT Documentation website URL [required] --collection TEXT Vector store collection name [default: project] ``` ### `run` - Full Pipeline (Ingest + Serve) ```bash uv run python main.py run [OPTIONS] Options: --repo TEXT GitHub repository URL or local path [required] --docs-url TEXT Documentation website URL (optional) --token TEXT GitHub token for private repositories --collection TEXT Vector store collection name [default: project] --host TEXT Host to bind the server to [default: 0.0.0.0] --port INTEGER Port to bind the server to [default: 8000] ``` ## MCP Server Installation & Integration ### Installing as MCP Server To use Docs MCP as a Model Context Protocol server with compatible clients (like Claude Desktop), follow these steps: 1. **Ingest Your Data First**: ```bash # Ingest the repository you want to analyze uv run python main.py ingest --repo https://github.com/username/repository # Or with documentation uv run python main.py ingest --repo https://github.com/username/repository --docs-url https://docs.example.com ``` 2. **Configure Claude Desktop**: Add to your Claude Desktop configuration file (`claude_desktop_config.json`): ```json { "mcpServers": { "docs-mcp": { "command": "uv", "args": [ "run", "--python", "/path/to/docs-mcp/.venv/bin/python", "/path/to/docs-mcp/mcp_standalone.py" ], "cwd": "/path/to/docs-mcp", "env": { "ANTHROPIC_API_KEY": "your-anthropic-api-key-here", "OPENAI_API_KEY": "your-openai-api-key-here" } } } } ``` **Important Notes**: - Replace `/path/to/docs-mcp` with the actual path to your cloned repository - The `--python` flag ensures the correct virtual environment Python is used - The MCP server runs automatically when Claude Desktop starts - no separate server startup needed - Communication happens over stdio (standard input/output), not HTTP 3. **Required API Keys**: - `ANTHROPIC_API_KEY` - For the Claude LLM used by the agent - `OPENAI_API_KEY` - For the OpenAI embeddings used by the vector store 4. **Restart Claude Desktop** after updating the configuration file ### How It Works The MCP server uses stdio (standard input/output) communication with Claude Desktop: - Claude Desktop launches `mcp_standalone.py` as a subprocess - The server communicates via JSON messages over stdin/stdout - No network ports or HTTP servers are involved - The server automatically loads your ingested repository data from the vector database ### MCP Tools Available When connected via MCP, the server provides these tools: - **search_repo**: Search through repository code and files - **search_docs**: Search through documentation content Example MCP tool usage: ```json { "tool": "search_repo", "arguments": { "query": "authentication implementation" } } ``` ## Project Structure ``` docs-mcp/ � main.py # CLI entry point � config.py # Configuration settings � query_db.py # Interactive query tool � ingestion/ # Data ingestion modules � � repo_ingestor.py # GitIngest integration � � docs_scraper.py # Documentation scraping � processing/ # Text processing � � chunker.py # Semantic text chunking � vectordb/ # Vector database operations � � vector_store.py # Chroma vector store � summaries/ # Directory summarization � � dir_summarizer.py # AI-generated overviews � agent/ # LangChain agent system � � agent_builder.py # Agent with search tools � server/ # MCP server implementation � mcp_server.py # FastAPI server ``` ## Data Storage - `data/vectordb/` - Persistent Chroma vector database collections - `data/summaries/` - Generated directory overview markdown files ## Development ```bash # Lint code uv run ruff check . # Auto-fix linting issues uv run ruff check . --fix --unsafe-fixes # Format code uv run ruff format . # Type check uv run mypy . ``` ## Environment Variables Create a `.env` file in the project root: ```bash OPENAI_API_KEY=your-openai-api-key-here ANTHROPIC_API_KEY=your-anthropic-api-key-here ``` **Required API Keys**: - `OPENAI_API_KEY` - Used for vector embeddings via OpenAI's text-embedding-3-small model - `ANTHROPIC_API_KEY` - Used for the Claude LLM (claude-3-5-sonnet-20241022) in the agent ## Troubleshooting ### Common Issues 1. **Python Version**: Ensure you're using Python >=3.10 (required by onnxruntime dependency) 2. **API Keys**: Make sure both OpenAI and Anthropic API keys are set in the `.env` file 3. **GitHub Rate Limits**: For large repositories or frequent ingestion, consider using a GitHub token 4. **Memory Usage**: Large repositories may require significant RAM for processing ### Getting Help - Check the logs for detailed error messages - Use `uv run python query_db.py` to verify data was ingested correctly - Test the MCP server with `curl` before integrating with external clients ## License [Your License Here]