/JSONLFineTunePrep

JSONLFineTunePrep is a powerful tool designed to transform raw data into JSONL format, perfect for fine-tuning machine learning models. It streamlines the conversion process for text, images, or structured data, ensuring optimal training results. Ideal for data scientists and AI developers

Primary LanguagePython


RawToPrompt


RawToPrompt πŸ“œβž‘οΈβ“

RawToPrompt is a project aimed at converting raw, unstructured text data into structured prompt-completion pairs suitable for fine-tuning Large Language Models (LLMs). It intelligently ingests raw .txt files, processes the content, and generates contextual prompts and completions based on the content using state-of-the-art NLP techniques.

Project Objective 🎯

To create a data pipeline:

  1. Ingestion: Accept raw .txt files.
  2. Processing: Clean and preprocess the text data.
  3. Interactive Q/A: Optionally ask the user for specific prompts or topics.
  4. Prompt Completion Generation: Generate AI-driven contextual prompts and completions.

Table of Contents πŸ“‘

Setup and Installation βš™οΈ

  1. Clone the repository:

    git clone <repository-link>
    cd RawToPrompt
  2. Install the required libraries:

    pip install -r requirements.txt

Usage πŸš€

  1. Place your .txt files in the project root or any directory of your choice.

  2. Run the main.py:

    python main.py
  3. Follow the on-screen instructions. You can provide your own prompts or allow the system to generate them contextually.

Project Structure 🌳

β”œβ”€β”€β”€data_ingestion
β”‚   └─── Module responsible for ingesting raw .txt files
β”œβ”€β”€β”€data_processing
β”‚   └─── Module for preprocessing and cleaning the text
β”œβ”€β”€β”€interactive_qa
β”‚   └─── Handles user interactions for prompt specifications
└───prompt_completion
    └─── Generates prompt-completion pairs using AI-driven techniques

Contributing 🀝

If you'd like to contribute to RawToPrompt, please fork the repository and use a feature branch. Pull requests are warmly welcome!

License πŸ“„

This project is licensed under the MIT License - see the LICENSE.md file for details.