Agora is an new open source Multi-Modality AI Research Organization devoted to advancing Humanity!
Since Kosmos-X is ready to train Agora is actively seeking cloud providers or grant providers to train this all-new revolutionary model and release it open source, if you would like to learn more please email me at kye@apac.ai
Join our Agora discord and contribute to this project or 40+ others!
This repository is a rudimentary reimplementation of the KOSMOS-1 model described in Microsofts recent paper Language Is Not All You Need: Aligning Perception with Language Models. Since the code is yet to be published at microsoft/unilm, this is an attempt to follow what is described in the paper as close as possible.
Help us create an Model Roadmap on Kosmos-X Figma
This repo requires apex to be installed from source:
git clone https://github.com/kyegomez/Kosmos-X
cd Kosmos-X
# Basic requirements (transformers, torch, etc.)
pip install -r requirements.txt
# apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..
cd Kosmos
accelerate config
then: accelerate launch train_distributed.py
We're just at the beginning of our journey. As we continue to develop and refine Kosmos-X, we invite you to join us. Whether you're a developer, researcher, or simply an enthusiast, your insights and contributions can help shape the future of Kosmos-X.
We are thrilled to invite you to be a part of the Kosmos-X project. This is not just an open source project but a community initiative, and we value your expertise and creativity. To show our appreciation, we have instituted a unique rewards system that directly compensates contributors from the revenue generated by the Kosmos-X API.
Contributing to Kosmos-X not only enhances your skills and profile but also comes with financial rewards. When you contribute code, documentation, or any form of improvement to the Kosmos-X project, you are adding value. As such, we believe it's only fair that you share in the rewards.
Here's how the Kosmos-X Rewards Program works:
-
Submit a Pull Request: This can be a code enhancement, bug fix, documentation update, new feature, or any improvement to the project.
-
Review and Approval: Our team will review your contribution. If it gets approved and merged, you become eligible for the rewards program.
-
Revenue Share: Once your pull request is merged, you will receive a percentage of the revenue generated by the Kosmos-X API. The percentage will be determined based on the significance and impact of your contribution.
This means you're not just contributing to an open source project; you're becoming a part of the Kosmos-X ecosystem. Your efforts can yield ongoing benefits as the Kosmos-X API grows and evolves.
As part of our growth strategy, we will be deploying Kosmos-X as a Paid API. The revenue generated from this API will not only sustain and further the project, but also fund the rewards program.
If you're ready to become a part of Kosmos-X and contribute to the future of multimodal embeddings, here's what you need to do:
-
Fork the repository.
-
Make your improvements or additions in your forked repository.
-
Submit a pull request detailing the changes you've made.
-
Our team will review your submission. If it's approved, it will be merged into the main repository, and you will become part of the Kosmos-X Rewards Program.
Thank you for considering contributing to Kosmos-X. Your expertise and commitment to this project are what make it thrive. Let's build the future of multimodal embeddings together.
KOSMOS-1 uses a decoder-only Transformer architecture based on Magneto (Foundation Transformers), i.e. an architecture that employs a so called sub-LN approach where layer normilization is added both before the attention module (pre-ln) and afterwards (post-ln) combining the advantages that either approaches have for language modelling and image understanding respectively. The model is also initialized according to a specific metric also described in the paper, allowing for more stable training at higher learning rates.
They encode images to image features using a CLIP VIT-L/14 model and use a perceiver resampler introduced in Flamingo to pool the image features from 256 -> 64
tokens. The image features are combined with the token embeddings by adding them to the input sequence surrounded by special tokens <image>
and </image>
. An example is <s> <image> image_features </image> text </s>
. This allows image(s) to be interwoven with text in the same sequence.
We follow the hyperparameters described in the paper visible in the following image:
We use the torchscale implementation of the decoder-only Transformer architecture from Foundation Transformers:
from torchscale.architecture.config import DecoderConfig
from torchscale.architecture.decoder import Decoder
config = DecoderConfig(
decoder_layers=24,
decoder_embed_dim=2048,
decoder_ffn_embed_dim=8192,
decoder_attention_heads=32,
dropout=0.1,
activation_fn="gelu",
attention_dropout=0.1,
vocab_size=32002,
subln=True, # sub-LN approach
xpos_rel_pos=True, # rotary positional embeddings
max_rel_pos=2048
)
decoder = Decoder(
config,
embed_tokens=embed,
embed_positions=embed_positions,
output_projection=output_projection
)
For the image model (CLIP VIT-L/14) we use a pretrained OpenClip model:
from transformers import CLIPModel
clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-L-14-laion2B-s32B-b82K").vision_model
# projects image to [batch_size, 256, 1024]
features = clip_model(pixel_values=images)["last_hidden_state"]
We follow the default hyperparams for the perceiver resampler as no hyperparams are given in the paper:
from flamingo_pytorch import PerceiverResampler
perceiver = PerceiverResampler(
dim = 1024,
depth = 2,
dim_head = 64,
heads = 8,
num_latents = 64,
num_media_embeds = 256
)
# projects image features to [batch_size, 64, 1024]
self.perceive(images).squeeze(1)
Because the model expects a hidden dimension of 2048
, we use a nn.Linear
layer to project the image features to the correct dimension and initialize it according to Magneto's initialization scheme:
image_proj = torch.nn.Linear(1024, 2048, bias=False)
torch.nn.init.normal_(
image_proj.weight, mean=0, std=2048**-0.5
)
scaled_image_features = image_proj(image_features)
The paper describes a SentencePiece with a vocabulary of 64007
tokens. For simplicity (as we don't have the training corpus available), we use the next best open-source alternative which is the pretrained T5-large tokenizer from HuggingFace. This tokenizer has a vocabulary of 32002
tokens.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained(
"t5-large",
additional_special_tokens=["<image>", "</image>"],
extra_ids=0,
model_max_length=1984 # 2048 - 64 (image features)
)
We then embed the tokens with a nn.Embedding
layer. We actually use a bnb.nn.Embedding
from
bitandbytes which allows us to use 8-bit AdamW later.
import bitsandbytes as bnb
embed = bnb.nn.Embedding(
32002, # Num embeddings
2048, # Embedding dim
padding_idx
)
For positional embeddings, we use:
from torchscale.component.embedding import PositionalEmbedding
embed_positions= PositionalEmbedding(
2048, # Num embeddings
2048, # Embedding dim
padding_idx
)
Also, we add an output projection layer to project the hidden dimension to the vocabulary size and initialize it according to Magneto's initialization scheme:
output_projection = torch.nn.Linear(
2048, 32002, bias=False
)
torch.nn.init.normal_(
output_projection.weight, mean=0, std=2048**-0.5
)
I had to make some slight changes to the decoder to allow it to accept already embedded features in the forward pass. This was necessary to allow the more complex input sequence described above. The changes are visible in the following diff in line 391 of torchscale/architecture/decoder.py
:
+if kwargs.get("passed_x", None) is None:
+ x, _ = self.forward_embedding(
+ prev_output_tokens, token_embeddings, incremental_state
+ )
+else:
+ x = kwargs["passed_x"]
-x, _ = self.forward_embedding(
- prev_output_tokens, token_embeddings, incremental_state
-)
- We're actively seeking cloud providers or grant providers to train this all-new revolutionary model and release it open source, if you would like to learn more please email me at kye@apac.ai
-
Integrate flash attention inside the
torchscale/component/multihead_attention.py
-
Integrate one write head is all you need
-
Look into integrating qk_norm
-
Look into integrating Falcon LLM model tokenizer if they allow special tokens
-
Prepare datasets, training strategies, and infrastructure for massive production level traning
-
Run tests and make sure trains well with all optimizations on small dataset