This project consists of three main components:
- Ollama Dataset Generator
- Groq Quality Evaluator
- Hugging Face Dataset Uploader
- Python 3.8 or higher
- pip (Python package installer)
- virtualenv
-
Clone this repository:
git clone https://github.com/dustinwloring1988/data-generator.git cd data-generator
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
-
Install the required packages:
pip install -r requirements.txt
-
Create a
.env
file in the project root directory and add your configuration variables (see Configuration section below).
Create a .env
file in the project root with the following variables:
# Hugging Face Upload Script
HUGGINGFACE_TOKEN=your_huggingface_token_here
LOCAL_FILE_PATH=path/to/your/local/file.jsonl
REPO_NAME=your-repo-name
# Groq Quality Evaluator
INPUT_DATASET_FILENAME=dataset.jsonl
OUTPUT_DATASET_FILENAME=evaluated_dataset.jsonl
MAX_RETRIES=3
TIMEOUT=20
NUM_THREADS=2
DELAY_BETWEEN_EVALUATIONS=25
GROQ_TOKENS=token1,token2,token3,token4,token5,token6,token7,token8
# Ollama Dataset Generator
OLLAMA_API_URL=http://localhost:11434/api/generate
NUM_SAMPLES=10000
MAX_RETRIES=5
TIMEOUT=30
NUM_THREADS=6
DATASET_FILENAME=dataset.jsonl
Replace the placeholder values with your actual configuration.
-
Generate dataset using Ollama:
python ollama-dataset-generator.py
-
Evaluate dataset quality using Groq:
python groq-quality-evaluator.py
-
Upload dataset to Hugging Face:
python huggingface-upload-script.py
This script generates a dataset using the Ollama API. It creates instruction-response pairs based on various templates and topics.
This script evaluates the quality of the generated dataset using the Groq API. It assigns quality scores and provides explanations for each sample.
This script uploads the generated and evaluated dataset to the Hugging Face Hub.
If you'd like to contribute to this project, please fork the repository and create a pull request with your changes.
MIT
Dustin Loring - DLoring1988@gmail.com
Project Link: https://github.com/dustinwloring1988/data-generator