[help wanted] Why does `torch.compile` dump each Triton kernel?

Question

[help wanted] Why does `torch.compile` dump each Triton kernel?

imShZh opened this issue 7 months ago · 1 comments

All stuff in depyf works fine.

After I ran the example in README with depyf, there are multiple files in target directory.

├── __compiled_fn_1 AFTER POST GRAD 0.py
├── __compiled_fn_1 Captured Graph 0.py
├── __compiled_fn_1 Forward graph 0.py
├── __compiled_fn_1 kernel 0.py
├── __compiled_fn_1 kernel 1.py
├── __compiled_fn_1 kernel 2.py
├── __compiled_fn_5 AFTER POST GRAD 0.py
├── __compiled_fn_5 Captured Graph 0.py
├── __compiled_fn_5 Forward graph 0.py
├── __compiled_fn_5 kernel 0.py
├── __compiled_fn_5 kernel 1.py
├── full_code_for_toy_example_0.py
├── __transformed_code_0_for_torch_dynamo_resume_in_toy_example_at_9.py
└── __transformed_code_0_for_toy_example.py

Why does torch.compile dump __compiled_fn_1 kernel 1.py and __compiled_fn_1 kernel 2.py while dumping __compiled_fn_1 kernel 0.py? Since the latter already contains the string form of the first two Triton kernels?

Answer 1 · 2024-08-21T18:14:14.000Z

thanks for your interest!

these are the intermediate steps of torch.compile. possibly torch.compile generates two kernels first, and then merge then into a single file 🤔