/sciphi

SciPhi is a simple framework for generating synthetic / fine-tuning data, and for robust evaluation of LLMs.

Primary LanguagePythonApache License 2.0Apache-2.0

SciPhi [ΨΦ]: A Framework for Cata Creation

Screenshot 2023-10-01 at 10 45 12 AM

SciPhi is an configurable Python framework designed to tackle the challenges of efficiently training LLM (Large Language Model) through synthetic data. At its core, SciPhi offers:

  • Configurable Data Generation: Efficiently produce LLM-mediated synthetic training and tuning datasets tailored to your specific needs.
  • The Library of Phi: An initiative to leverages AI-driven techniques to craft high-quality open source textbooks.

Getting Started & Support

  • Engage with our active Discord community for discussions, troubleshooting, and collaboration.

  • For specialized support or collaboration inquiries, feel free to reach out directly.

Library of Phi Generation

Introduction:
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages. " Workflow:
The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG over all of Wikipedia, with intermittent work done by LLMs.

  1. Scrape MIT OCW Course Webpages.
  2. Extract Syllabi.
  3. Formulate Table of Contents.
  4. Craft Textbooks.

Generating the default Textbook:

poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Aerodynamics_of_Viscous_Fluids --log-level=DEBUG

See the example output here

Using a Custom Table of Contents:

  1. Draft a table of contents and save as textbook_name.yaml.
  2. Place it in [Your Working Directory]/sciphi/data/library_of_phi/table_of_contents.
  3. Format similarly to Aerodynamics_of_Viscous_Fluids.yaml.

Incorporating RAG over Wikipedia:

  1. Enable the --do-wiki flag: True.
  2. In .env, set:
    • WIKI_SERVER_URL
    • WIKI_SERVER_USERNAME
    • WIKI_SERVER_PASSWORD

Output:
Generated textbooks reside in:
[Your Working Directory]/sciphi/data/library_of_phi

Note: The Wikipedia embeddings server is not yet public. Meanwhile, ensure your configuration aligns with our specifications if you wish to use wikipedia for RAG. If you would like to peruse more example textbooks, go here.

Installation

# Clone the repository
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi

# Install dependencies
# If you don't have poetry installed: pip3 install poetry
poetry install -E all

# Set up your environment
# Note: Modify the .env file as needed after copying
cp .env.example .env && vim .env

Requirements

  • Python: >= 3.11 and < 3.12
  • Poetry: For package management

Optional Features

Install optional dependencies for enhanced features:

poetry install -E <extra_name>

Options include:

  • anthropic_support: For Anthropic models.
  • hf_support: For diverse model access with the HuggingFace package.
  • openai_support: For OpenAI models.
  • vllm_support: For VLLM, aiding fast inference.
  • llama_index_support: For LlamaIndex, enhancing grounded synthesis.
  • chroma_support: For Chroma support in large vector databases.
  • all: Includes all dependencies (excluding vllm, which needs separate installation).
  • all_with_cuda: Everything.

Customizable Data Generation

For fully configurable and flexible data generation, execute the relevant runner.py with various command-line arguments.

poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split

The above command will generate a single sample from GPT-4. This sample is generated using the textbooks_are_all_you_need_basic_split configuration, and the output is appended to example_output.jsonl.

The long-term view of the SciPhi framework is to provide a training-feedback loop as shown below:

Screenshot 2023-09-18 at 9 53 55 AM

Command-Line Arguments

See arguments and their default values in the README. Notable ones include --provider, --model_name, and --temperature.

Replicating Full Table of Contents Generation

Step 0: Scrape MIT OCW for course details.

poetry run python sciphi/examples/library_of_phi/raw_data/ocw_scraper.py scrape

Step 1: Convert scraped data into 'draft' syllabi YAMLs.

poetry run python sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py run

Step 2: Refine the draft YAML into the finalized syllabi.

poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run

Step 3: Transition the syllabi to a 'draft' table of contents.

poetry run python sciphi/examples/library_of_phi/gen_step_3_draft_table_of_contents.py run

Step 4: Produce clean table of contents YAML files.

poetry run python sciphi/examples/library_of_phi/gen_step_4_clean_table_of_contents.py run

License

Licensed under the Apache-2.0 License.

Citations

  1. Textbooks Are All You Need
  2. Textbooks Are All You Need II: Phi-1.5 Technical Report

Citation

If using SciPhi in academic work, please cite:

@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{SciPhi}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}