Machine Learning Engineering Guides and Tools

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM). e.g., a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B model in 2023. Currently, I'm working on developing/training open-source Retrieval models at Contextual.AI.

I've been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual I'm happy to share these with the wider ML community.

Contributing

If you found a bug, typo or would like to propose an improvement please don't hesitate to open an Issue or contribute a PR.

License

The content of this site is distributed under Attribution-ShareAlike 4.0 International.

tonyle9/ml-engineering

Machine Learning Engineering Guides and Tools

Debugging software and hardware failures

Performance

Multi-Node networking

Model parallelism

Tensor precision / Data types

Training hyper-parameters and model initializations

Reproducibility

Instabilities

SLURM

Resources

HF Transformers notes

Contributing

License