deepdoc: A Python repository from Datalore.ai

Overview

DeepDoc is a tool that performs deep research on your local resources instead of the internet. It uses a research-style workflow to explore your documents, organize the findings, and generate a clear markdown report. This way, you can quickly uncover insights from your own files without manually digging through them.

How It Works

Start by uploading local resources (PDF, DOCX, JPG, TXT, etc.).
The system extracts text and splits it into page-wise chunks.
These chunks are stored in a vector database for semantic similarity search.
Based on your instruction query, a content structure is generated.
You can provide feedback to refine the structure.
The tool then generates report sections and section topics.
For each section, research agents:
- Generate knowledge for the section.
- Create research queries.
- Run search agents over the chunked local data.
- Use reflection agents to refine results.
- Generate final section content.
Section-wise content is compiled and passed to a final report writer.
The output is a complete, structured report in markdown format.

Workflow

This diagram shows how Local DeepResearcher takes your local resources and instructions, processes and analyzes the content, and turns it into a structured report.

Getting Started

Follow these steps to set up and run the project locally.

Prerequisite: Install `uv`

uv is required to manage the virtual environment and dependencies.

You can download it from the official uv GitHub repository, which includes platform-specific installation instructions.

1. Clone the Repository

git clone https://github.com/Datalore-ai/deepdoc.git
cd deepdoc

2. Create a Virtual Environment

Use uv to create a virtual environment:

uv venv

3. Activate the Virtual Environment

Activate the environment depending on your OS:

Windows:

.venv\Scripts\activate

macOS/Linux:

source .venv/bin/activate

4. Set Up Environment Variables

Copy the example .env file and add your API keys:

cp .env.example .env

Open the .env file in a text editor and fill in the required fields:

MISTRAL_API_KEY=
TAVILY_API_KEY=
OPENAI_API_KEY=

# Default
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=knowledge_base
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
QDRANT_DISABLE_THREADING=true # Don't change this

These keys are essential for the application to work correctly.

5. Install Dependencies

Install required packages using:

uv pip install -r requirements.txt

5. Set Up Docker for Qdrant vectorDB

Make sure you have Docker and Docker Compose installed. Then start the required services (e.g., Qdrant) using:

docker-compose up --build

This will spin up the necessary services in the background.

6. Run the Application

Once the environment and services are ready, start the application:

python main.py

You're all set to go! The application will now guide you through the dataset creation process step by step and the final dataset will be saved in the output_files directory.

Optional: `configuration.py`

You can customize how the tool behaves using the configuration.py file. It lets you adjust 2 parameters for this application.

import uuid

LLM_CONFIG = {
    "provider": "openai",
    "model": "gpt-4o-mini", 
    "temperature": 0.5,
}

THREAD_CONFIG = {
    "configurable": {
        "thread_id": str(uuid.uuid4()),
        "max_queries": 3,
        "search_depth": 2,
        "num_reflections": 2,
        "n_points": 1,
    }
}

Authors

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Datalore-ai/deepdoc