A novel algorithm that integrates a text, diffusion LLM as a draft model to boost the performance of traditional auto-regressive LLMs. Try it now
Built for the 2025 Mercor x Cognition x Etched Hackathon
We've implemented two memory optimization strategies for the LLaMA-LLaDA distillation process to address memory constraints when working with large language models:
-
Simplified Strategy (
train_simple_strat.py
): Focuses on the most essential memory optimizations:- Gradient checkpointing for the student model
- No gradient computation for the teacher model
- Mixed precision training (bfloat16)
- Periodic CUDA cache clearing
-
Comprehensive Strategy (
train_all_strat.py
): Implements all optimizations from the simplified strategy plus:- Data pre-tokenization
- Gradient accumulation
- Advanced optimizer configuration
- Learning rate scheduling
For detailed information on these optimizations, see OPTIMIZATION.md.
-
combine_datasets.py: This script loads and combines datasets from different sources, ensuring all columns are present in each dataset. The final dataset is saved as a Parquet file.
-
scripts/
: Contains various scripts for dataset handling and model evaluation:- custom_dataset.py: Custom dataset handling.
- download_dataset.py: Script to download datasets.
- evaluate_direct.py: Direct evaluation of models.
- evaluate_speculative.py: Speculative evaluation of models.
- fine_tune.py: Script for fine-tuning models.
- generate.py: Script to generate outputs from models.
- speculative_decoding.py: Script for speculative decoding.
The project requires Python and several dependencies listed in requirements.txt. To install them, use:
pip install -r requirements.txt
-
Combine Datasets: Run combine_datasets.py to load, process, and save a combined dataset.
python combine_datasets.py
-
Scripts: Use the scripts in the
scripts/
directory for specific tasks like downloading datasets, evaluating models, fine-tuning, and generating outputs.
-
custom_dataset.py: Defines a custom dataset class for loading data from a directory where each entry is stored as a JSON file.
from custom_dataset import get_dataloader dataloader = get_dataloader('path/to/dataset', batch_size=8, shuffle=True)
-
download_dataset.py: Downloads a dataset from Hugging Face and saves each entry under a directory named after the dataset.
python download_dataset.py --dataset_name <dataset_name> --split <split> --save_dir <save_directory>
-
evaluate_direct.py: Evaluates model performance using direct decoding.
python evaluate_direct.py --model_name <model_name> --evaluation_dataset <evaluation_dataset> --max_length <max_length>
-
evaluate_speculative.py: Evaluates model performance using speculative decoding with a teacher and student model.
python evaluate_speculative.py --teacher_model <teacher_model> --student_model <student_model> --evaluation_dataset <evaluation_dataset> --max_length <max_length> --speculative_steps <speculative_steps>
-
fine_tune.py: Fine-tunes a Hugging Face model on a specified dataset.
python fine_tune.py --model_name <model_name> --dataset_name <dataset_name> --fine_tuned_model_name <fine_tuned_model_name> --batch_size <batch_size> --learning_rate <learning_rate> --num_train_epochs <num_train_epochs> --max_length <max_length> --checkpoint <checkpoint>
-
generate.py: Generates model outputs based on a specified dataset and configuration.
python generate.py --model_name <model_name> --dataset_name <dataset_name> --batch_size <batch_size> --config <config> --max_length <max_length>
-
speculative_decoding.py: Performs speculative decoding using a teacher and student model.
from speculative_decoding import speculative_generate output = speculative_generate(teacher_model, student_model, teacher_tokenizer, student_tokenizer, input_text, max_length=50, speculative_steps=3)
The project relies on various Python packages, including but not limited to:
datasets
pandas
torch
transformers
For a full list of dependencies, refer to the requirements.txt file.
This project is licensed under the MIT License.