This is my first branching-off from my new customGPT repo meant mainly as a test of that repo since i don't expect this idea to perform that well. In it i'll be saying fuck it to efficiency & ram and giving each attention block heads of a variety of sizes in a kind of self-similar manner
This repo is part of a larger project of mine called micro_model_sandbox that's basically a hub for all the novel model experiments I do, with the goal of facilitating easy comparison between the different models. Basically for each of those experiments I just use the customGPT repo as a template to start editing, and then once I'm happy with the project (or if I've just abandoned it but it's moderately functional) I add it to the sandbox. If you end up using that repo as a template, as i did here for FHA, feel free to contribute your project to the sandbox as well!
modules/
: where all of the code for the actual model goesfha.py
: This is the primary file that makes this project unique from customGPT. At some point I'll update this readme or maybe add a jupyter notebook that'll provide a full detailed visual walkthrough of the edit i've made to the traditional multi-query attention mechanism.layer.py
: defines each residual connection layer of our GPTlogging.py
: defines theLoggingModule
class, a wrapper that you should use instead of pytorch'snn.module
in order to facilitate easy demonstration of how tensor shapes change throughout a given modulemlp.py
: a two-layer multi-layer perceptron with an optional gate and either ReLU, GeLU, or SiLU nonlinearities, all configurable inconfig.py
. Adding more nonlinearities is also absurdly easymodel.py
: the primary class for our GPTnorm.py
: a norm module with an optional affine layer that allows you to switch between RMSNorm, LayerNorm and CosineNorm easily using a setting over inconfig.py
. Adding different normalization methods is also absurdly easy
tokenizers/bpe/
models/
{95, 128, 256, 512, 1024, 2048}.model
: the 95 one is character-wise tokenization. All others are byte-pair encoding, except instead of bytes i use the 95 unique characters that show up in the TinyStories dataset
build.ipynb
: the notebook where i built my bpe tokenizers. My pairing rules could certainly be improved upontokenizer.py
: an overly-simplistic and annoyingly inefficient tokenizer with bos & eos tokens, post-sequence padding, and adisplay
function to help you visualize how a given string is broken down
trained/
FHA_GPT_{0.3m_2024-05-07|13-05-29, 0.8m_2024-05-05|10-54-35}/
: a series of tiny models designed to be compared against one another. they're not large enough to get intelligible output; planning to make some bigger ones and train them for longer at some pointlog_data.csv
: record of loss & perplexity data over the course of trainingmodel_config.json
: hyperparameters of the modelmodel.pth
: weights of the modeltrain_config.json
: hyperparameters of the training loop used to train the model
inference.ipynb
: open this notebook if you just want to see the output of one of the modelsmodel_comparison.ipynb
: open this notebook to compare different models against each other. includes loss curve plots and topk teacher-forcing accuracy ratetesting_modules.ipynb
: creates easy printouts that allow you to follow the progression of tensor shapes for demonstration & debugging purposes of all the modules inmodel.py
. If you're building new modules for a novel architecture idea you have then this notebook will be of extreme value to you in debugging & visualizationtrain.ipynb
: open this notebook to train a new modelconfig.py
: all of the editable model and training settingsinference.py
: functions for performing inference, used ininference.ipynb
andtrain.ipynb
requirements.txt
- I should probably change this to only include the packages that are actually necessary and not be so strict on versions. The command I used to get this list ispip freeze | grep -v " @ file://" > requirements.txt
, lmk if you know of a better methodtools.py
: A variety of functions & classes that don't fit elsewhere and/or are used by more than one of the jupyter notebookstrain.py
: functions for training a model, used intrain.ipynb
- train some larger models for a full 1000 iterations that i can then compare with customGPT ones over in micro_model_sandbox
- see if i can figure out how to make it into efficient tensor operations instead of for loops
Other than the above TODO lists, appreciated contributions include:
- bug fixes
- adding more detailed comment explanations of what the code is doing
- general readability edits
- efficiency edits
- editing the code in
modules/
to take better advantage of theLoggingModule
. This means splitting up each class into more and tinier functions - training more models (especially if they're bigger than what's already here!)
Because I'm not super knowledgeable on how collaborating on git projects works and I tend to edit directly on the main branch, please reach out and communicate with me about any edits you plan to make so that I can avoid editing the same files. Click here to join my discord server
- guides on how to build miniature versions of popular models from scratch, with a hand-holding walkthrough of every single tensor operation: minGemma, minGrok, and minLlama3
- my YouTube channel
- my other links