M3rlin

Multilingual, Multimodal, Multidomain (M3) Model

We are using training code Openlm and FMengine which can run on JUWELS
The code in this repo is the M3rlin specific code, which is data loading and interleaving of embeddings and an extra mse loss.
Clone or pip install the openlm or FMengine code directly to use for training.

add extraction of image or embeddings form hf dataset or jsonl (jsonl is usually faster)
test that the embedding is saved to webdataset format
test loading embeddings and token ids in train.py
write the code to insert token_id into the token sequences
embeddings should be saved in a 3D tensor (batch, embedding_id, embedding_dim) and returned
positions are an array of 3D tensor (batch, sequence_id, column_id)
need to add MSE loss
confirm and test MSE loss
Add up-projection and down-projection for embedings input and output embedding
Add Peft and freezing base model

huu4ontocord/M3rlin