32 byte alignment for AVX2
digikar99 opened this issue · 4 comments
The following suggests that currently, static-vectors are aligned at 16 byte memory addresses, after this discussion:
SB-VM> (disassemble 'd4-ref)
; disassembly for D4-REF
; Size: 83 bytes. Origin: #x52B8FA9A ; D4-REF
; 9A: 498B4510 MOV RAX, [R13+16] ; thread.binding-stack-pointer
; 9E: 488945F8 MOV [RBP-8], RAX
; A2: C5FD2844BA01 VMOVAPD YMM0, [RDX+RDI*4+1]
; A8: 49896D28 MOV [R13+40], RBP ; thread.pseudo-atomic-bits
; AC: 498B5568 MOV RDX, [R13+104] ; thread.alloc-region
; B0: 4C8D5A30 LEA R11, [RDX+48]
; B4: 4D3B5D70 CMP R11, [R13+112]
; B8: 7729 JNBE L2
; BA: 4D895D68 MOV [R13+104], R11 ; thread.alloc-region
; BE: L0: 66C7026905 MOV WORD PTR [RDX], 1385
; C3: 80CA0F OR DL, 15
; C6: 49316D28 XOR [R13+40], RBP ; thread.pseudo-atomic-bits
; CA: 7402 JEQ L1
; CC: CC09 INT3 9 ; pending interrupt trap
; CE: L1: 48C742F904000000 MOV QWORD PTR [RDX-7], 4
; D6: C5FC114201 VMOVUPS [RDX+1], YMM0
; DB: 488BE5 MOV RSP, RBP
; DE: F8 CLC
; DF: 5D POP RBP
; E0: C3 RET
; E1: CC10 INT3 16 ; Invalid argument count trap
; E3: L2: 6A31 PUSH 49
; E5: E84B0947FF CALL #x52000435 ; CONS->RNN.AVX2
; EA: 5A POP RDX
; EB: EBD1 JMP L0
NIL
SB-VM> (loop for i below 1000000 do
(static-vectors:with-static-vector
(a 8 :element-type 'double-float
:initial-contents '(2.0d0 3.0d0 4.0d0 5.0d0 2.0d0 3.0d0 4.0d0 5.0d0))
(declare (optimize speed))
(d4-ref a 2)))
NIL
Is it possible to get a 32 byte alignment - perhaps as an option - for aligned AVX2 access? An alternative I could think of is to use displaced arrays.
EDIT: Running the loop says nothing! Turned out the same memory was being reallocated. But some further testing does suggest that there exists 16 byte alignment for allocation but not 32.
No, you cannot get guaranteed 32-byte alignment. What you can do is allocate a slightly larger vector than needed (add 32 to the desired size, then round upwards to a multiple of 32), then use STATIC-VECTOR-POINTER to compute the offset that is 32-byte aligned.
Does this require support on the implementation side? M-.
led me to %foreign-alloc
, that, in turn, calls sb-alien:make-alien
on SBCL.
Yes, this would need implementation support.
Alright, thanks for confirming!