nixified-ai/flake

textgen-nvidia build fails on NixOS-WSL

Opened this issue · 2 comments

I updated the flake and now I am getting this error for textgen-nvidia:

error: builder for '/nix/store/900bmg4iknf0yb7r1b3f5xdfarqc9yzy-triton-llvm-14.0.6-f28c006a5895.drv' failed with exit code 1;
       last 10 log lines:
       > In file included from /build/source/llvm/include/llvm/Support/YAMLTraits.h:23,
       >                  from /build/source/llvm/include/llvm/CodeGen/MIRYamlMapping.h:22,
       >                  from /build/source/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h:21,
       >                  from /build/source/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h:18:
       > /build/source/llvm/include/llvm/Support/SourceMgr.h: In member function ‘bool llvm::SMFixIt::operator<(const llvm::SMFixIt&) const’:
       > /build/source/llvm/include/llvm/Support/SourceMgr.h:241: note: ‘-Wmisleading-indentation’ is disabled from this point onwards, since column-tracking was disabled due to the size of the code/headers
       >   241 |     if (Range.Start.getPointer() != Other.Range.Start.getPointer())
       >       |
       > /build/source/llvm/include/llvm/Support/SourceMgr.h:241: note: adding ‘-flarge-source-files’ will allow for more column-tracking support, at the expense of compilation time and memory
       > ninja: build stopped: subcommand failed.
       For full logs, run 'nix log /nix/store/900bmg4iknf0yb7r1b3f5xdfarqc9yzy-triton-llvm-14.0.6-f28c006a5895.drv'.
error: 1 dependencies of derivation '/nix/store/0xf7hi05hpx45khnbwvrhh1rxc5vc9j2-python3.11-triton-2.0.0.drv' failed to build
error (ignored): error: cannot unlink '/tmp/nix-build-nccl-2.18.5-1.drv-3': Directory not empty
error (ignored): error: cannot unlink '/tmp/nix-build-magma-2.7.2.drv-1': Directory not empty
error: 1 dependencies of derivation '/nix/store/vr6knfixvhazw998iqz207dr99ffhbv7-python3-3.11.5-env.drv' failed to build
error: 1 dependencies of derivation '/nix/store/9wbs0ybrpkc81b0x26wrsdhb7c86iqa2-textgen.drv' failed to build

@alexvorobiev please check the dmesg command to see what the Linux kernel thinks about the situation, are you sure you're not just running out of memory during the build, causing something to get killed, causing the spurious error?

Can you also post more logs, you have truncated the logs so it does not show the full story.

Yes, the issue was with the memory - I increased the amount of memory given to WSL to 30G (from 16G) and I also had to reduce the number of parallel jobs nix used to build using --core. It was using all 24 cores available in my CPU, I randomly tried 8 and it completed. I haven't tried loading any models yet. I was surprised that it wants to build the whole world from scratch (specifically torch and triton-llvm), shouldn't the packages be in the binary cache?