/FractalFormer

A GPT with self-similar nested properties

Primary LanguageJupyter Notebook

FractalFormer

this is a project where I create self-similarity at (hopefully) all levels of a decoder-only transformer. The idea is to take all of the things learned in my matryoshkaGPT replication project and instead of having a series of single russian nesting dolls inside of each other, each "inside" contains multiple similar russian nesting dolls. Think of it like how each triangle in Surpinski's Triangle has three triangles within it. I think at some point this will allow me to do interesting things such as

  • multiple small models for MOE speculative decoding in parallel to increase chances of a match
  • a new weird version of MOE where all experts exist simultaneously rather than being gated
  • infinite fusion of transformer models of a given size into transformer models of a larger size
  • take advantage of the fact that language has a fractal-structure^1^2 to create an (infinitely? effectively infinitely?) extendable maximum context length if i can figure out how to properly borrow ideas from my previous next-concept prediction project and/or from Multi-Word Tokenization for Sequence Compression. more on this later
  • specialize a model for use with conversational swarm intelligence
  • i think i can eventually meet the criteria for consciousness as defined in psychology of consciousness paper

Repo Guide

  • FractalFormer_base.ipynb: currently the only file that is both functional and readable. This is where i recommend you start if you're curious about the project because it's not only heavily commented but also has extensive print statements so you can see for yourself what's happening. I do however need to go update all the images and give more thorough walkthroughs in the pre-code markdown cells. Check out the following video I made on this file:

Error displaying thumbnail. Click here for video

  • FractalFormer_ModelMerging.ipynb: This document is currently in a very early stage, i think all i've done so far is make a dynamically defined hyperparameter config file. Basically I'd like to train multiple separate non-FractalFormer small models, freeze their weights, and then concatenate & merge them into a proper FractalFormer as defined in the previous document. If you'd like to contribute and want more details on the task at hand let me know; i think this is one that i could properly convey to someone who knows what they're doing with coding transformers in pytorch.
  • FractalFormer_UpstreamResidual.ipynb: Further work on this file has been delayed until the above is finished. Not sure I can fully convey why i'm doing what i'm doing here as I'm still working largely off of intuition. Basically, in the base version when you perform inference you have to choose between which of the models you want to run and they all are capable of running separately, but here in UpstreamResidaul what I want to do is for any given model you want to run inference on, all of its sub-models will also run in parallel and their residual states will be concatenated and added to that of the model of interest. This is essentially how i create a connection between all the models in my eventual goal into creating a kind of hive-mind. Might switch this to an additive or multiplicative LoRA later to filter the total amount of information that gets transferred to a smaller subspace.
  • FractalFormer_DownstreamResidual.ipynb: like the previous document except instead of sending the residual streams from the small models up to the big ones, i split apart the residual streams of the big model and send it down to the small ones. I think this may be useful for my MOE idea down the line.
  • FractalFormer_InbuiltTokenizer.ipynb: the idea here is to use byte-level tokens and let the model essentially create its own tokens, thereby completely getting rid of the tokenization step in language modeling but still having a similarly emergent framework to allow for an effecitve vocab size larger than 256. I'm messing around with different potential ways to do this over in weird_embeddings.ipynb but we're a ways off from me having something concrete to explain. Basically for now i just plan on adding & norming the embeddings of bytes together to create a "concept" and then having the higher level models think in terms of concepts rather than the 256 vocab options they have.
  • config.py, tokenizer.py, and FractalFormer_base.py WERE all code directly copied from FractalFormer_base.ipynb so that the classes & functions can then be imported into the other files. I say "were" because i've recently changed the jupyter notebook and need to remember to copy & paste the new stuff into the .py files.
  • input.txt is just TinyShakespare. Eventually we'll branch out but this is fine for testing on my macbook.
  • tokenizers/tokenizer.model is a very simple tokenizer that takes the 65 unqiue characters in tinyshakespeare and turns them into 128 tokens. Originally made for the repo that I build all my models off of here. Might eventually get deleted if the concept-byte thing works, or it might hang around for even longer.
  • models/ contains all of the models i've trained so far, which as of right now consists of 3 checkpoints of a roughly 1m parameter model from FractalFormer_base.ipynb. I don't think i'll be going past 1 million parameters for the foreseeable future, really until i nail down an architecture that i like well enough to start do legit testing.
  • images/ is where i put drawings that help demonstrate what's happening in the code. One of these days i'll make a far easier guide.

Relevant inspiration papers that weren't already cited: