dblalock/bolt

About `zip_bolt_colmajor`

sunhs opened this issue · 2 comments

sunhs commented

out_ptrs[gg] = codes_out + (col_out * simd_vec_sz);

I'm sorry but I don't quite understand this line.

Let's say we have ncols_out_per_group=4, then we have

    out_ptrs[0] = codes_out + (col_out_1 * 32)
    out_ptrs[1] = codes_out + (col_out_2 * 32)
    out_ptrs[2] = codes_out + (col_out_3 * 32)
    out_ptrs[3] = codes_out + (col_out_4 * 32)

where col_out_2 = col_out_1 + 1, col_out_3 = col_out_2 + , and col_out_4 = col_out_3 + 1.

Then, when we're done with one block, as in

out_ptrs[gg] += simd_vec_sz * ncolgroups;

out_ptrs[gg] is increased by 32 * ncolgroups. If we have, say, ncolgroups=2, then out_ptrs[0] would equal to out_ptrs[2], and out_ptrs[1] would equal to out_ptrs[3].

If I was mistaken could you please point it out. Would be best if you could tell me the layout of the zipped output.

Thanks a lot!

If I recall correctly, we go from storing 4-bit codes as u8 values in column-major order to storing pairs of 4-bit codes in a blocked column-major layout. In more detail:

  • Suppose our input is of shape (N, C).
  • Assuming N % B == 0 for some block size B (in this case 32), we can view it as chunks of shape (N/B, B, C).
    -Then we do:
for n in 1 to N/B:
    for c in 1 to C / 2:
       output[n, :, c] = pack(input[n, :, (2*c):(2*c + 1)]  # pack two 32B cols into one 32B col

where pack() combines pairs of 4-bit values into a single byte by left-shifting one of the values by 4. The goal is to let us do sequential, vectorized reads across all codebooks for blocks of 32 rows (with the codebook indices packed to reduce reads).

Hope that helps. Let me know if it's still not clear (this function has no docs, after all...)

sunhs commented

@dblalock Thanks a lot. I'll look into it again.