training/pretraining related to mm model

other notes

flash-attn

since recent transformers forces flash-attn on some modules (e.g. mistral), it will break tons of things i have if something happens with env and i update transformers/torch. for borah since its quadro RTX 8000, it is turing. make sure you do not try and install flash-attn2+ as it will not work or will install but breaks. max version that seems to work and easy to install is flash-attn-1.0.8 ==> MAX_JOBS=4 pip install flash-attn==1.0.8 --no-build-isolation

layout

scripts
- should have 2 versions of train
  - 1 for single-gpu where i can ensure dataset/model/etc is working as expected and for easier debugging
  - 1 for fsdp that is simple and based mostly off of finetune llama
src/config
- configs/model-init/etc for local/eng dev vs borah
src/pretrain_mm
- datasets
  - not clear to me how much data i need yet and how much data from others is actually all that useful
  - from most useful to least is probably mind2web, silatus, common_scenes,
  - mind2web i can make the most similar to what eventual mm agent will be like
- distributed
  - fsdp related, keep as much out of the main scripts as possible as makes debugging much harder
- processor
  - need to have more customizable way to do image+text tokenization/vectorizing for later incontext tasks

data

keeping all the data on eng402001 as have plenty of space there i know it wont get auto-deleted like the cluster
mind2web
- need to use globus, it was a bit of an ordeal to get the transfer
- once have it somewhere i can stage/dev, to transfer to borah:
  - scp -r /data/graham/datasets/mind2web/data borah:/bsuhome/gannett/scratch/datasets/mind2web

grahamannett/pretrain-mm

training/pretraining related to mm model

other notes

flash-attn

layout

data