qwopqwop200/GPTQ-for-LLaMa

Sample code does not work

foamliu opened this issue · 2 comments

Thanks for the great work, here are errors from my side (one host with eight V100 GPUs):

CUDA_VISIBLE_DEVICES=0 python llama_inference.py /home/xxx/models/hf_converted_llama/7B/ --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|████████████████████| 12/12 [00:33<00:00, 2.80s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]python: /project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted (core dumped)

I have the same issue.

have you tried to use the --no_fused_mlp option when running the command? If this solves the issue we can add it to the readme and close this issue. I added the option because I had the same error.