A minimal yet capable Rust implementation of a GPT-style transformer model using burn.
FeGPT implements a small-scale transformer model focused on text generation. Built to explore transformer architectures in Rust, it provides a complete training and inference pipeline with approximately 14.2M parameters.
Key features:
- Complete transformer architecture implementation
- WikiText-2 dataset integration with extensible dataset system
- Efficient training on consumer hardware (Apple Silicon/CUDA)
- Comprehensive experiment tracking and checkpointing
- Temperature-controlled text generation
- Real-time training metrics and progress monitoring
fegpt/
├── src/
│ ├── cli.rs # Command-line interface
│ ├── model.rs # Transformer implementation
│ ├── data/ # Data processing
│ │ ├── batcher.rs # Training batch creation
│ │ ├── dataset/
│ │ │ ├── wikitext.rs # WikiText-2 implementation
│ │ │ └── utils.rs # Dataset utilities
│ │ └── tokenizer/
│ │ ├── character.rs # Alternative tokenizer
│ │ └── gpt2.rs # GPT-2 tokenizer
│ ├── perplexity.rs # Perplexity metrics
│ └── session.rs # Training management
└── models/ # Saved model checkpoints
Add FeGPT to your Cargo.toml
:
[dependencies]
fegpt = { git = "https://github.com/airstrike/FeGPT.git" }
Training a model:
cargo run -- train --d-model 128 --n-layer 4 --n-head 4 --max-iters 10000
Generating text:
cargo run -- generate --prompt "Once upon a time" --num-tokens 100 --model latest
Listing trained models:
cargo run -- list
- 4 transformer layers with 4 attention heads each
- 128-dimensional embeddings
- 512-dimensional feed-forward networks
- GPT-2 tokenizer (50,257 vocabulary)
- ~14.2M parameters total
- Embeddings/projection: ~12.9M
- Transformer layers: ~1.3M
- Positional embeddings: ~8K
Default configuration:
Batch size: 12
Context length: 64
Learning rate: 1e-4
Warmup steps: 1000
Training iterations: 10000
Typical results:
- Training time: ~19 minutes (10K iterations)
- Final perplexity: ~31
- Hardware: Apple M2 Max (32GB)
Building:
cargo build --release
Running tests:
cargo test
- Apple Silicon with Metal support, or
- NVIDIA GPU with CUDA support
- 32GB RAM recommended
- Text generation quality needs improvement
- Limited context window (64 tokens)
- Large embedding layer due to full GPT-2 vocabulary
- Vocabulary size optimization
- Context length extension
- Enhanced attention mechanisms
- Dataset flexibility (Tiny Shakespeare support)
- Performance optimization for consumer hardware
This project is licensed under the MIT License - see the LICENSE file for details.
Built with:
- burn - Rust ML framework
- tokenizers - Hugging Face tokenizers
Contributions are welcome! Please feel free to submit a Pull Request.