RawToPrompt is a project aimed at converting raw, unstructured text data into structured prompt-completion pairs suitable for fine-tuning Large Language Models (LLMs). It intelligently ingests raw .txt
files, processes the content, and generates contextual prompts and completions based on the content using state-of-the-art NLP techniques.
To create a data pipeline:
- Ingestion: Accept raw
.txt
files. - Processing: Clean and preprocess the text data.
- Interactive Q/A: Optionally ask the user for specific prompts or topics.
- Prompt Completion Generation: Generate AI-driven contextual prompts and completions.
-
Clone the repository:
git clone <repository-link> cd RawToPrompt
-
Install the required libraries:
pip install -r requirements.txt
-
Place your
.txt
files in the project root or any directory of your choice. -
Run the
main.py
:python main.py
-
Follow the on-screen instructions. You can provide your own prompts or allow the system to generate them contextually.
ββββdata_ingestion
β ββββ Module responsible for ingesting raw .txt files
ββββdata_processing
β ββββ Module for preprocessing and cleaning the text
ββββinteractive_qa
β ββββ Handles user interactions for prompt specifications
ββββprompt_completion
ββββ Generates prompt-completion pairs using AI-driven techniques
If you'd like to contribute to RawToPrompt, please fork the repository and use a feature branch. Pull requests are warmly welcome!
This project is licensed under the MIT License - see the LICENSE.md file for details.