/tiny_llm_trainer

The experiment implements a tiny language model trainer using PyTorch.

Primary LanguagePythonApache License 2.0Apache-2.0

Tiny LLM Trainer

The experiment implements a tiny language model trainer using PyTorch. I designed it to train on Wikipedia data and generate text based on the learned patterns.

Features

  • PyTorch-based implementation
  • Transformer architecture
  • Configurable model size and training parameters
  • Text generation with temperature and top-k sampling

Requirements

  • Python 3.7+
  • PyTorch
  • NumPy
  • Pillow

Project Structure

.
├── data
├── models
├── wikipedia_data.py
├── tiny_llm_trainer.py
├── flickr_data.py
├── tiny_llm_trainer_vqa.py
├── cvc_data.py
└── tiny_llm_trainer_cvc.py

Files

  • data/: Directory where preprocessed training data from Wikipedia is saved.
  • models/: Directory where trained models are saved.
  • wikipedia_data.py: Script for downloading and preprocessing Wikipedia data.
  • tiny_llm_trainer.py: The main script for training the model.
  • flickr_data.py: Script for downloading and preprocessing Flickr image data.
  • tiny_llm_trainer_vqa.py: Script for training the model on Visual Question Answering (VQA) tasks using Flickr data.
  • cvc_data.py: Script for downloading and preprocessing Common Voice Corpus 1 data.
  • tiny_llm_trainer_cvc.py: Script for training a TTS model using Common Voice Corpus 1 data.

Usage

  1. Python Package Installer:

    pip3 install uv
  2. Prerequisites:

    python3 -m venv .venv
    source .venv/bin/activate
    uv pip install -r requirements.txt
    python3 -m pip install --upgrade pip
    deactivate # deactivate virtual environment

Text Generation

  1. Prepare Data:

    python3 wikipedia_data.py
  2. Train LLM:

    python3 tiny_llm_trainer.py

Visual Question Answering (VQA)

  1. Prepare Data:

    python3 flickr_data.py
  2. Train VQA — Multimodal:

    python3 tiny_llm_trainer_vqa.py

Text-to-Speech (TTS)

  1. Prepare Data:

    python3 cvc_data.py
  2. Train TTS:

    tiny_llm_trainer_cvc.py

References

License

This project is licensed under the Apache License 2.0.

Citation

@misc{tlt2024,
  author       = {Oketunji, A.F.},
  title        = {Tiny LLM Trainer},
  year         = 2024,
  version      = {0.0.6},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.12593929},
  url          = {https://doi.org/10.5281/zenodo.12593929}
}

Copyright

(c) 2024 Finbarrs Oketunji. All Rights Reserved.