docker build crashes my machine

Question

docker build crashes my machine

dustyatx opened this issue a year ago · 7 comments

Not sure if this is a problem with my machine or not but I have a 32 core machine i9 with 128 GB of RAM & this pegs all my cores at 100%. Looks like there is some substantial compiling going on. The second time it spikes up my machine crashes. I was trying to figure out how to limit the number of cores it uses.. not sure but it might also spike memory usage up to 128GB or that could be an artifact of HTOP being disconnected over a SSH session hard to say.

Not my area of expertise but I haven't run into to many containers doing this kind of intensive build, normally it's just software installations and configuration.

I'm including the terminal logging that I captured.
dockerbuild_terminal_logging.txt

Answer 1 · 2023-12-12T17:01:46.000Z

OK just confirmed that it also crashes on a Google Cloud VM, 22 Cores, 128GB of ram. Looks like this one ran out of RAM as well..
gc_docker_build_fail_22core_128.txt

Answer 2 · 2023-12-13T16:26:14.000Z

Hi! I had the same issue. I changed the Dockerfile so to include the env variable MAX_JOBS:

RUN rm -rf ./flash-attention/* &&
pip uninstall flash_attn -y &&
git clone https://github.com/Dao-AILab/flash-attention.git &&
cd flash-attention/csrc/rotary && MAX_JOBS=4 python setup.py install &&
cd ../layer_norm && MAX_JOBS=4 python setup.py install &&
cd ../../ && MAX_JOBS=4 python setup.py install

RUN MAX_JOBS=4 pip install ninja tokenizers==0.14.1 einops transformers==4.34.1

I used 4, but I guess you can go for something higher (maybe 10).

Answer 3 · 2023-12-15T03:11:22.000Z

@nedRad88 AH thank you.!! It's been a long time since I worked with Docker. I was trying to pass MAX_JOBS in with the docker build command and it wouldn't work. Looking forward to giving this a try.

Answer 4 · 2023-12-15T03:54:06.000Z

Yes! @nedRad88 's suggestion should work 👍 . ninja can use a lot of resources when building flash-attn if MAX_JOBS is not specified. Feel free to re-open if the issue persists.

Answer 5 · 2023-12-15T13:21:47.000Z

@nedRad88 & @orangetin

This work around did not work for me. I updated the Docker file with the MAX_JOBS set to 4 and I'm still running out of memory. Any other advice?

`
FROM nvcr.io/nvidia/pytorch:23.06-py3

WORKDIR /workdir

RUN rm -rf ./flash-attention/* &&
pip uninstall flash_attn -y &&
git clone https://github.com/Dao-AILab/flash-attention.git &&
cd flash-attention/csrc/rotary && MAX_JOBS=4 python setup.py install &&
cd ../layer_norm && python setup.py install &&
cd ../../ && python setup.py install

RUN MAX_JOBS=4 pip install ninja tokenizers==0.14.1 einops transformers==4.34.1
`

Answer 6 · 2023-12-15T23:47:56.000Z

OK I got it to build.. had to rent a VM big enough to handle this.. for some reason the max_jobs setting didn't seem to do anything for me.

96 Core & 604 GB (300+ used at peak).

Answer 7 · 2023-12-16T00:57:34.000Z

MAX JOBS controls how many processes to launch in parallel for compilation. You could try adding that as env variable or setting that for each flash_attn install step e.g.,

cd flash-attention/csrc/rotary && MAX_JOBS=4 python setup.py install &&
cd ../layer_norm && MAX_JOBS=4 python setup.py install &&
cd ../../ && MAX_JOBS=4 python setup.py install