vosen/ZLUDA

Add a compiler pass to convert 32 bit PTX to 64 bit PTX

Opened this issue ยท 5 comments

Expected outcome:

Comments:

  • It should be either:
    • Its own compiler flow (make changes to 32 bit ptx and emit textual 64 bit textual flow) to work with . That's what we ultimately want.
    • It might be easier to just add it to the existing compiler flow as the first pass. It will work only for ZLUDA.
  • This pass does not have to be fully general, it should be good enough to convert constructs found in 32 bit PhysX PTX
  • The biggest changes will memory accesses. Every kernel will receive additional implicit p64 bit pointer argument that points into the memory pool. We will convert all 32 bit pointer access to offset accesses into this pool. Texrefs will probably remain unchanged if all the tex instructions use global variables
  • This pass will be in cahoots with the runtime (#354). E.g. runtime needs to know there's an implicit 64 bit pointer as the first argument.

An example. From:

.version 6.5
.target sm_30
.address_size 32

.visible .entry add(
    .param .u32 input,
    .param .u32 output
)
{
    .reg .u32       in_addr;
    .reg .u32       out_addr;
    .reg .u32       temp;
    .reg .u32       temp2;

    ld.param.u32    in_addr, [input];
    ld.param.u32    out_addr, [output];

    ld.u32          temp, [in_addr];
    add.u32         temp2, temp, 1;
    st.u32          [out_addr], temp2;
    ret;
}

to:

.version 6.5
.target sm_30
.address_size 64

.visible .entry add(
    .param .u64 _implicit_buffer,
    .param .u32 input,
    .param .u32 output
)
{
    .reg .u32       in_addr;
    .reg .u32       out_addr;
    .reg .u32       temp;
    .reg .u32       temp2;
    .reg .u64       _implicit_buffer_ptr;

    ld.param.u64    _implicit_buffer_ptr, [_implicit_buffer];
    ld.param.u32    in_addr, [input];
    ld.param.u32    out_addr, [output];

    {
        .reg .u64   _address_64;
        .reg .u64   _offset_64;
        cvt.u64.u32 _offset_64, in_addr;
        add .u64    _address_64, _implicit_buffer_ptr, _offset_64;
        ld.u32      temp, [_address_64];
    }
    add.u32         temp2, temp, 1;
    {
        .reg .u64   _address_64;
        .reg .u64   _offset_64;
        cvt.u64.u32 _offset_64, out_addr;
        add .u64    _address_64, _implicit_buffer_ptr, _offset_64;
        st.u32     [_address_64], temp2;
    }
    ret;
}

I'd love to contribute to the project. However, I don't have access to an AMD GPU. Is there a way to set up a testing environment or a simulation to verify functionality without the target hardware?

So, is it possible that after the 32-bit (x86) program is converted to 64-bit (x64), I'll be able to run it correctly on my NVIDIA GPU? I'd like to confirm if this is the expected outcome.

I've looked at the FluidMark program, which is a benchmark for PhyXCore. It's a fatbin composed of SASS for sim10, sim20, and sim30. I tried to use the PhyX SDK to compile PTX but encountered complex environment issues. Could you provide a PhyX benchmark that only uses PTX?

So, is it possible that after the 32-bit (x86) program is converted to 64-bit (x64), I'll be able to run it correctly on my NVIDIA GPU? I'd like to confirm if this is the expected outcome.

It's up to the implementer. After it's converted from 32bit to 64bit it could run on either (with AMDGPU-only path being probably a bit simpler)

I've looked at the FluidMark program, which is a benchmark for PhyXCore. It's a fatbin composed of SASS for sim10, sim20, and sim30. I tried to use the PhyX SDK to compile PTX but encountered complex environment issues. Could you provide a PhyX benchmark that only uses PTX?

I'm not sure what you mean, FluidMark fatbins contain PTX. You can see logs here: #352