Lance729/lamma2-MP

A fork of FAIR's LLaMa, can do inference on 2 8GB GPU with out quantitate。 Easy to understand code changes.

PythonNOASSERTION

Llama 2 MP(Model Parallel)

This is a folk of LLaMA. The purpose of this folk is to run the smallest 7B model on two 8GB GPUs(e.g. 2*2080 8GB).

How to run

Get the model file following the original Repo's instruction.
Install the dependencies.
Run the codes in simple-example.py line by line.

What did I do

Use simple torch layers replace the fairscale's complex layers.
Initialize the model on two GPUs. BLOCKS_IN_GPU0 is used to control how the model is split.
Minor changes in generation.py to move model's output to GPU0(their are some operations in generation.py that need be done in GPU0).