This is a folk of LLaMA. The purpose of this folk is to run the smallest 7B model on two 8GB GPUs(e.g. 2*2080 8GB).
- Get the model file following the original Repo's instruction.
- Install the dependencies.
- Run the codes in
simple-example.py
line by line.
- Use simple torch layers replace the fairscale's complex layers.
- Initialize the model on two GPUs.
BLOCKS_IN_GPU0
is used to control how the model is split. - Minor changes in
generation.py
to move model's output to GPU0(their are some operations ingeneration.py
that need be done in GPU0).