DataSeek is a versatile, extensible framework for autonomous data collection and prospecting. It uses AI agents to discover, validate, and organize data from sources like academic papers, Wikipedia, and web searches, targeting configurable data quality characteristics.
DataSeek is a powerful tool for automated data collection and prospecting. It uses advanced AI techniques to search, validate, and organize data from various sources including academic papers, Wikipedia, and web searches.
- Automated data prospecting from multiple sources
- Configurable mission plans for targeted data collection
- Support for academic papers (arXiv), Wikipedia, and web searches
- Built-in validation and filtering mechanisms
- Command-line interface for easy execution
-
Clone and set up the project:
git clone <repository-url> cd dataseek
-
Create and activate a virtual environment:
uv venv source .venv/bin/activate -
Install dependencies:
uv pip install -e . -
Review and customize configuration files:
- Agent configuration:
config/seek_config.yaml(see Configuration Guide) - Mission configuration:
config/mission_config.yaml(see Mission Configuration Guide)
- Agent configuration:
-
Run the default mac_ai_corpus_v1 mission:
dataseek --mission mac_ai_corpus_v1
This collects 50 samples per characteristic into
examples/datasets/mac_ai_corpus/samples/with audit trails inexamples/datasets/mac_ai_corpus/PEDIGREE.md.note: this can be run with the TUI for a more elegant experience:
dataseek-tui --log tui.log
-
Monitor progress through the terminal interface or check the output directories:
examples/datasets/mac_ai_corpus/samples/- Raw data samplesexamples/datasets/mac_ai_corpus/PEDIGREE.md- Audit trail of the data collection process
- DataSeek Overview: High-level project description and purpose.
- Mission Runner: Manages the execution and state of data collection missions.
- Search Graph: Defines the AI agent workflow for data prospecting and validation.
- Tool Manager: Handles tool registration, execution, and integration for agents.
- TUI (Terminal User Interface): Provides an interactive terminal interface for monitoring missions.
- Data Seek Agent Guide: Instructions for curating datasets using the agent.
- Configuration Guide: Details setup and customization of agent and mission configurations.
- Prompting Guide: Explains prompt assembly for different agent nodes.
- Tools Guide: Documentation on available tools and their usage.
- From Idea to Dataset: Step-by-step guide to configuring your first mission.
- Plugin System Tutorial: How to create and integrate custom plugins.
To set up the development environment:
uv venv
source .venv/bin/activate
uv pip install -e .After installation, you can run DataSeek in two ways:
dataseek --mission mac_ai_corpus_v1dataseek-tui config/mission_config.yamlYou can also use Python module syntax:
python -m seek.main --mission research_datasetTo use a custom agent configuration file:
dataseek --mission research_dataset --config config/my_seek_config.yamlDataSeek uses two types of configuration files:
This file configures the overall behavior of the DataSeek agent, including model settings, search parameters, and output paths.
See Agent Configuration Guide for detailed documentation.
This file defines specific missions with their goals and parameters, including target sizes, synthetic budgets, and topic lists.
See Mission Configuration Guide for detailed documentation.
To set up the development environment with test dependencies:
uv venv
source .venv/bin/activate
uv pip install -e .[dev]-
Create venv + install dev deps
uv venv && source .venv/bin/activate uv pip install -e .[dev]
-
Optional: install pre-commit hooks
pip install pre-commit pre-commit install
-
Without activating the venv, you can use
uv runto execute commands:pytest -q ruff check --fix . black . mypy seek bandit -c pyproject.toml -r seek
-
Format (Black)
black . -
Lint + import sort + modernize (Ruff)
ruff check --fix . -
Type check (MyPy)
mypy seek plugins --exclude tests
-
Security scan (Bandit)
bandit -c pyproject.toml -r seek
-
Tests with coverage
pytest -q --cov=seek
- check_prompts.py: Verifies prompt templates in
config/prompts.yamlagainst code references (missing, unused, placeholder mismatches). - dup.awk: Identifies groups of adjacent duplicate lines in log files by extracting core messages.
- local_ci_runner.py: Generates shell scripts to execute specific CI jobs locally (e.g.,
uv run python scripts/local_ci_runner.py quality-checks).
Run tests with:
pytestTest discovery looks in tests/ and component-local seek/components/**/tests/ as configured in pyproject.toml.
Core modules live under seek/:
- Common utilities:
seek/common/(config, models, utils) - Components:
seek/components/- Mission Runner:
seek/components/mission_runner/ - Search Graph:
seek/components/search_graph/ - Tool Manager:
seek/components/tool_manager/ - TUI:
seek/components/tui/
- Mission Runner:
Note on LiteLLM/Ollama
- Some Ollama models (e.g., gpt-oss-20b) may not work reliably with LiteLLM’s default transformations.
- We apply a small runtime shim in
seek/components/patch.pyand import it early inseek/__init__.pyto stabilize tool-call handling. - This patch is intentionally not part of Tool Manager because it affects LLM call plumbing across components.
