/dataset-generator

A new way to generate large quantities of high quality synthetic data (on par with GPT-4), with better controllability, at a fraction of the cost of prompting LLMs directly.

Primary LanguageJupyter NotebookMIT LicenseMIT

Navigating the Geometry of Language

A New Approach to Synthetic Text Generation

This demo is a practical example of the geometric approach to latent space sampling as described in the paper Navigating the Geometry of Language: A New Approach to Synthetic Text Generation. It allows you to generate new synthetic data given some reference text using OpenAI’s ada-002 embedding model. You can browse the live demo here.

Quickstart

This demo requires PyTorch to be compiled with CUDA support.

pip install -r requirements.txt
OPENAI_API_KEY=<KEY> OPENAI_ORGANIZATION=<ORG> streamlit run streamlit_app.py