AI Dataset Generation and Evaluation Project

This project consists of three main components:

  1. Ollama Dataset Generator
  2. Groq Quality Evaluator
  3. Hugging Face Dataset Uploader

Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • virtualenv

Installation

  1. Clone this repository:

    git clone https://github.com/dustinwloring1988/data-generator.git
    cd data-generator
    
  2. Create a virtual environment:

    python -m venv venv
    
  3. Activate the virtual environment:

    • On Windows:
      venv\Scripts\activate
      
    • On macOS and Linux:
      source venv/bin/activate
      
  4. Install the required packages:

    pip install -r requirements.txt
    
  5. Create a .env file in the project root directory and add your configuration variables (see Configuration section below).

Configuration

Create a .env file in the project root with the following variables:

# Hugging Face Upload Script
HUGGINGFACE_TOKEN=your_huggingface_token_here
LOCAL_FILE_PATH=path/to/your/local/file.jsonl
REPO_NAME=your-repo-name

# Groq Quality Evaluator
INPUT_DATASET_FILENAME=dataset.jsonl
OUTPUT_DATASET_FILENAME=evaluated_dataset.jsonl
MAX_RETRIES=3
TIMEOUT=20
NUM_THREADS=2
DELAY_BETWEEN_EVALUATIONS=25
GROQ_TOKENS=token1,token2,token3,token4,token5,token6,token7,token8

# Ollama Dataset Generator
OLLAMA_API_URL=http://localhost:11434/api/generate
NUM_SAMPLES=10000
MAX_RETRIES=5
TIMEOUT=30
NUM_THREADS=6
DATASET_FILENAME=dataset.jsonl

Replace the placeholder values with your actual configuration.

Usage

  1. Generate dataset using Ollama:

    python ollama-dataset-generator.py
    
  2. Evaluate dataset quality using Groq:

    python groq-quality-evaluator.py
    
  3. Upload dataset to Hugging Face:

    python huggingface-upload-script.py
    

Scripts Description

ollama-dataset-generator.py

This script generates a dataset using the Ollama API. It creates instruction-response pairs based on various templates and topics.

groq-quality-evaluator.py

This script evaluates the quality of the generated dataset using the Groq API. It assigns quality scores and provides explanations for each sample.

huggingface-upload-script.py

This script uploads the generated and evaluated dataset to the Hugging Face Hub.

Contributing

If you'd like to contribute to this project, please fork the repository and create a pull request with your changes.

License

MIT

Contact

Dustin Loring - DLoring1988@gmail.com

Project Link: https://github.com/dustinwloring1988/data-generator