Forked from @lishenxydlgzs
A Model Context Protocol (MCP) server that provides semantic search capabilities across files. This server watches specified directories and creates vector embeddings of file contents, enabling semantic search across your documents.
Enhanced Version - This fork includes additional file processing capabilities for documents, PDFs, and images.
Add to your MCP settings file:
{
"mcpServers": {
"files-vectorstore": {
"command": "npx",
"args": [
"/path/to/simple-files-vectorstore/build/index.js"
],
"env": {
"WATCH_DIRECTORIES": "/path/to/your/directories"
},
"disabled": false,
"autoApprove": []
}
}
}MCP settings file locations:
- VSCode Cline Extension:
~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json - Claude Desktop App:
~/Library/Application Support/Claude/claude_desktop_config.json
This version supports automatic text extraction from multiple file types:
- Word documents (
.docx) - OpenDocument Text (
.odt) - EPUB files (
.epub) - Rich Text Format (
.rtf) - LaTeX documents (
.tex) - reStructuredText (
.rst)
- Text extraction from PDFs using
pdftotext - Handles text-based PDFs efficiently
- JPEG/JPG (
.jpg,.jpeg) - PNG (
.png) - GIF (
.gif) - BMP (
.bmp) - TIFF (
.tiff) - WebP (
.webp)
- Document files → Convert with Pandoc
- PDF files → Extract text with pdftotext
- Image files → Extract text with Tesseract OCR
- Text files → Process directly (original behavior)
To use the enhanced file processing features, install these dependencies:
# macOS
brew install pandoc # Document conversion
brew install poppler # PDF text extraction (pdftotext)
brew install tesseract # Image OCR
# Ubuntu/Debian
sudo apt-get install pandoc poppler-utils tesseract-ocr
# Windows (via Chocolatey)
choco install pandoc poppler tesseractThe server requires configuration through environment variables:
You must specify directories to watch using ONE of the following methods:
WATCH_DIRECTORIES: Comma-separated list of directories to watchWATCH_CONFIG_FILE: Path to a JSON configuration file with awatchListarray
Example using WATCH_DIRECTORIES:
{
"mcpServers": {
"files-vectorstore": {
"command": "npx",
"args": [
"/path/to/simple-files-vectorstore/build/index.js"
],
"env": {
"WATCH_DIRECTORIES": "/path/to/dir1,/path/to/dir2"
},
"disabled": false,
"autoApprove": []
}
}
}CHUNK_SIZE: Size of text chunks for processing (default: 1000)CHUNK_OVERLAP: Overlap between chunks (default: 200)IGNORE_FILE: Path to a .gitignore style file to exclude files/directories based on patternsINGESTION_LOG_PATH: Path to ingestion log file (default:/Users/onk/Documents/Vector/.ingestionlog)VECTOR_STORE_PATH: Directory for persistent vector storage (default:~/.simple-files-vectorstore)
Example with all optional parameters:
{
"mcpServers": {
"files-vectorstore": {
"command": "npx",
"args": [
"/path/to/simple-files-vectorstore/build/index.js"
],
"env": {
"WATCH_DIRECTORIES": "/path/to/dir1,/path/to/dir2",
"CHUNK_SIZE": "2000",
"CHUNK_OVERLAP": "500",
"IGNORE_FILE": "/path/to/.gitignore",
"INGESTION_LOG_PATH": "/custom/path/to/ingestion.log"
},
"disabled": false,
"autoApprove": []
}
}
}The server maintains a detailed log of all file processing activities:
2025-09-01T23:47:00.000Z | ADD | SUCCESS | /path/to/file.pdf
2025-09-01T23:47:05.000Z | ADD | FAILED | /path/to/image.jpg | OCR extraction failed
2025-09-01T23:47:10.000Z | REMOVE | SUCCESS | /path/to/deleted.txt
- ADD SUCCESS: File successfully processed and indexed
- ADD FAILED: File processing failed (with reason)
- REMOVE SUCCESS: File successfully removed from index
Common failure reasons:
Pandoc conversion failedPDF extraction failedOCR extraction failedNot a text fileNo processor found
This server provides the following MCP tools:
Perform semantic search across indexed files.
Parameters:
query(required): The search query stringlimit(optional): Maximum number of results to return (default: 5, max: 20)folder(optional): Folder path to limit search scope
Example usage:
// Search all files
search({query: "infrastructure documentation"})
// Search within specific folder
search({query: "infrastructure", folder: "General"})Example response:
[
{
"content": "matched text content",
"source": "/path/to/file",
"fileType": "txt",
"score": 0.85,
"lastModified": 1706123456789,
"lastModifiedDate": "2024-01-24T12:34:56.789Z"
}
]Search files by modification date with optional semantic search.
Parameters:
after(optional): ISO date string - files modified after this datebefore(optional): ISO date string - files modified before this datequery(optional): Search query to combine with date filteringlimit(optional): Maximum number of results to return (default: 5, max: 20)
Example usage:
// Files modified after a specific date
search_by_date({after: "2024-01-01"})
// Files modified in a date range
search_by_date({after: "2024-01-01", before: "2024-02-01"})
// Combine date filtering with semantic search
search_by_date({after: "2024-01-01", query: "documentation"})Get statistics about indexed files.
Parameters: None
Example response:
{
"totalDocuments": 42,
"watchedDirectories": ["/path/to/docs"],
"processingFiles": []
}- Enhanced file support: Documents, PDFs, and images via OCR
- Real-time file watching and indexing
- Semantic search using vector embeddings
- Folder-scoped search: Limit searches to specific directories
- Persistent vector storage: Eliminates re-ingestion on restart
- Comprehensive logging of all ingestion activities
- Configurable processing with environment variables
- Background processing of files
- Automatic handling of file changes and deletions
- Flexible configuration via environment variables
npm install
npm run buildThe built files will be in the build/ directory.