sionescu/static-vectors

32 byte alignment for AVX2

digikar99 opened this issue · 4 comments

The following suggests that currently, static-vectors are aligned at 16 byte memory addresses, after this discussion:

SB-VM> (disassemble 'd4-ref)
; disassembly for D4-REF
; Size: 83 bytes. Origin: #x52B8FA9A                          ; D4-REF
; 9A:       498B4510         MOV RAX, [R13+16]                ; thread.binding-stack-pointer
; 9E:       488945F8         MOV [RBP-8], RAX
; A2:       C5FD2844BA01     VMOVAPD YMM0, [RDX+RDI*4+1]
; A8:       49896D28         MOV [R13+40], RBP                ; thread.pseudo-atomic-bits
; AC:       498B5568         MOV RDX, [R13+104]               ; thread.alloc-region
; B0:       4C8D5A30         LEA R11, [RDX+48]
; B4:       4D3B5D70         CMP R11, [R13+112]
; B8:       7729             JNBE L2
; BA:       4D895D68         MOV [R13+104], R11               ; thread.alloc-region
; BE: L0:   66C7026905       MOV WORD PTR [RDX], 1385
; C3:       80CA0F           OR DL, 15
; C6:       49316D28         XOR [R13+40], RBP                ; thread.pseudo-atomic-bits
; CA:       7402             JEQ L1
; CC:       CC09             INT3 9                           ; pending interrupt trap
; CE: L1:   48C742F904000000 MOV QWORD PTR [RDX-7], 4
; D6:       C5FC114201       VMOVUPS [RDX+1], YMM0
; DB:       488BE5           MOV RSP, RBP
; DE:       F8               CLC
; DF:       5D               POP RBP
; E0:       C3               RET
; E1:       CC10             INT3 16                          ; Invalid argument count trap
; E3: L2:   6A31             PUSH 49
; E5:       E84B0947FF       CALL #x52000435                  ; CONS->RNN.AVX2
; EA:       5A               POP RDX
; EB:       EBD1             JMP L0
NIL
SB-VM> (loop for i below 1000000 do
            (static-vectors:with-static-vector
                (a 8 :element-type 'double-float
                   :initial-contents '(2.0d0 3.0d0 4.0d0 5.0d0 2.0d0 3.0d0 4.0d0 5.0d0))
              (declare (optimize speed))
              (d4-ref a 2)))
NIL

Is it possible to get a 32 byte alignment - perhaps as an option - for aligned AVX2 access? An alternative I could think of is to use displaced arrays.

EDIT: Running the loop says nothing! Turned out the same memory was being reallocated. But some further testing does suggest that there exists 16 byte alignment for allocation but not 32.

No, you cannot get guaranteed 32-byte alignment. What you can do is allocate a slightly larger vector than needed (add 32 to the desired size, then round upwards to a multiple of 32), then use STATIC-VECTOR-POINTER to compute the offset that is 32-byte aligned.

Does this require support on the implementation side? M-. led me to %foreign-alloc, that, in turn, calls sb-alien:make-alien on SBCL.

Yes, this would need implementation support.

Alright, thanks for confirming!