google/heir

Halevi-Shoup Matrix Multiplication

Opened this issue · 20 comments

Halevi-Shoup Matrix Multiplication

Matrix multiplications are ubiquitous across applications involving machine learning, computer vision, search, and more. Providing an efficient method of matrix multiplications over homomorphically encrypted data can enable the secure evaluation of these applications.

Baseline approach

Let’s consider a typical matrix-vector multiplication where the vector is encrypted but the matrix is unencrypted (i.e. plaintext):

1 2 3 4   a   1a + 2b + 3c + 4d
5 6 7 8 x b = 5a + 6b + 7c + 8d
9 10 11 12   c   9a + 10b + 11c + 12d
13 14 15 16   d   13a + 14b + 15c + 16d

Since homomorphic encryption enables SIMD operations by packing multiple values into a vector of operations and enabling pairwise addition or multiplication operations, one (old) conventional method of matrix vector multiplication is for each row of values in the matrix to be packed into a vector and the to-be multiplied matrix to be packed into a vector, then calculate dot products for each resulting value in the output vector by calculating pairwise multiplications between rows in the matrix and the vector, then summing up all values in the resulting vector (with rotation and add operations). For a n-by-n matrix multiplied by a n-sized vector, this would require n multiplications, and n(log2(n)) additions and n(log2(n)) rotations to obtain n ciphertexts for each value in the output vector.

[1, 2, 3, 4] 		x 	[a, b, c, d]	=	[1a, 2b, 3c, 4d]
[5, 6, 7, 8]		x	[a, b, c, d]	=	[5a, 6b, 7c, 8d]
[9, 10, 11, 12]    	x	[a, b, c, d]	= 	[9a, 10b, 11c, 12d]
[13, 14, 15, 16]	x	[a, b, c, d]	= 	[13a, 14b, 15c, 16d]

Individual values of the output vector can be summed up as follows (only one shown for brevity):

[1a, 2b, 3c, 4d] + rot2([1a, 2b, 3c, 4d]] = [1a+3c, 2b+4d, 1a+3c, 2b+4d]
[1a+3c, 2b+4d, 1a+3c, 2b+4d] + rot1([1a+3c, 2b+4d, 1a+3c, 2b+4d]) = [1a+2b+3c+4d, 1a+2b+3c+4d, 1a+2b+3c+4d, 1a+2b+3c+4d]

The values of the vector can be packed back into one vector by using extract operations and summing the values together.

Halevi-Shoup Matrix multiplication

A better approach is to intelligently pack the ciphertexts and use rotations to enable a more efficient matrix multiplication resulting in a packed output ciphertext with fewer operations. By diagonalizing the matrix and performing rotation operations where appropriate:

[1, 6, 11, 16] 	x	[a, b, c, d]	=	[1a, 6b, 11c, 16d]
[2, 7, 12, 13]	x	[b, c, d, a]	=	[2b, 7c, 12d, 13a]
[3, 8, 9, 14]  	x	[c, d, a, b]	=	[3c, 8d, 9a, 14b]
[4, 5, 10, 15]	x	[d, a, b, c]	=	[4d, 5a, 10b, 15c]

Finally, we can add the four resulting vectors to obtain one packed ciphertext:

[1a + 2b + 3c + 4d, 5a + 6b + 7c + 8d, 9a + 10b + 11c + 12d, 13a + 14b + 15c + 16d]

This improves performance by requiring only n multiplications, n additions, and n-1 rotations for a packed output ciphertext.

Optimizing Rotations

One more noteworthy thing is that to perform rotations on ciphertexts typically requires a rotation key for the number of rotations that one would like to perform. It is possible to generate a single rotation key that rotates the ciphertext only once and still rotate it n times; however, this can be inefficient for parallelization reasons performance-wise because to perform the last SIMD multiplication, one must wait for n-rotates (which are known to be expensive) first. A possible solution to this is by having n rotation keys, each rotating 1, 2, 3, …, n times, but this can be a large number of keys, so a better solution is to have log2(n)-1 rotation keys, each rotating 1, 2, 4, …, 2^(log2(n)-1) times and doing all possible rotations using these keys. Ultimately this will still only require n-1 rotations, but enables better parallelization by being able to “skip” straight to the later rotations needed.

Other Notes

  • Matrix-matrix multiplications are a natural extension of multiple Halevi-Shoup matrix-vector multiplications.
  • In some applications like machine learning training, it is useful to have the matrix (in the matrix vector multiplication) also be encrypted, then Halevi-Shoup matrix vector multiplication is still possible; however, diagonally packing the matrix could be tricky if the matrix is already packed a different way.

Last Questions

  • What dialect is best to apply this lowering or transform pass? I'm guessing we could be transforming or lowering the linalg.quantized_matmul. Also, what do we lower to?

Progress

  • This issue is currently being addressed by #944 (merged).
j2kun commented

Some minor notes to add.

By diagonalizing the matrix and performing rotation operations where appropriate

Is diagonalize really the term of art? I think we should avoid overloading that with the normal use of the term in linear algebra.

A better solution is to have log2(n)-1 rotation keys, each rotating 1, 2, 4, …, 2^(log2(n)-1) times and doing all possible rotations using these keys

For a longer discussion of how we might do this more generally, see #744

What dialect is best to apply this lowering or transform pass?

Ideally we can apply it at the linalg level. A linalg.generic has a particular form where the output entry is defined by the combination of "affine maps" (index mappings) and the per-output-entry scalar computation (see https://www.lei.chat/posts/mlir-linalg-dialect-and-patterns/). If we can represent this matrix encoding as a change to a linalg.generic, that would prove much more generalizable, and make the extension to matrix-matrix multiplication quite simple. To this end, the first question I have is: can this encoding be expressed purely in terms of affine maps? I'd guess the answer is "yes" (isn't the mapping (i, j) -> (j, i+j mod n)? I think the modulus is supported in affine_map, implicitly or explicitly, but I'd have to check).

IMO, expressing the semantics (sum the result vectors) instead of the mechanism (log # of rotate-and-adds) would be a good place to start, since we already have a rotate-and-reduce pass. This would imply you don't need to lower to any special dialect. Though, if we decide we need to lower in an FHE-aware manner, we can use tensor_ext for rotate ops, etc.

I think the least-preferable (but not all that bad) option would be to define Halevi-Shoup operations specifically (e.g., in encode and/or tensor_ext).

diagonally packing the matrix could be tricky if the matrix is already packed a different way.

Repacking seems hard, would prefer static analysis to determine an optimal packing ahead of time, and possibly asking the client to provide multiple packings of the same data. Also cf. #800, which I haven't yet read, for more recent developments on this topic.

Is diagonalize really the term of art? I think we should avoid overloading that with the normal use of the term in linear algebra.

Quickly skimming Halevi's and Shoup's HElib paper, they describe pre-processing the matrix in a "diagonal order" and refer to the algorithm and refer to the matrix-vector multiplication algorithm as a "diagonal-based algorithm," so we could use that terminology instead. Interestingly, Halevi and Shoup cite other sources regarding the origins of the diagonal-based matrix-vector multiplication.

Will respond to the other notes later.

j2kun commented

In https://www.shoup.net/papers/helib.pdf the authors describe it as

the parallel “systolic” multiplication algorithm

But I don't quite understand how this is related to systolic arrays...

@Naming: In my experience, people do call this the "Halevi-Shoup Method" or "Halevi-Shoup Diagonal(ization) Method". The "diagonalization" is definitively unfortunate name overloading, but "systolic" isn't great either. I guess "diagonal-based" works, and one could describe their packing as "diagonal-major" (similar to row-major/column-major)?

While the algorithm is really elegant, it's primarily useful for ptxt-ctxt matrix-vector products, where the "diagonal packing" of the (ptxt) matrix is essentially free, and all you need to do is rotate your ctxt vector a few times. I guess it'd also work for the first ctxt-ctxt mvp, as you could encode the matrix as diagonals. If you want to apply it later on in your program, the cost of transforming to the diagonal packing is usually pretty prohibitive, and people have come up with lots of different approaches that try to balance the efficiency of the actual linalg operations with the cost of getting things into the right representation.

EDIT: See also https://www.microsoft.com/en-us/research/video/private-ai-bootcamp-techniques-in-ppml/ for a nice overview of the technique and how to make it more general (non-square matrices)

See the discussions on HELayers, Viaduct-HE 280 or FHElipe:

I remember Edward Chen (@edwjchen) was looking into this at some point?

can this encoding be expressed purely in terms of affine maps? I'd guess the answer is "yes" (isn't the mapping (i, j) -> (j, i+j mod n)? I think the modulus is supported in affine_map, implicitly or explicitly, but I'd have to check)

I like the simplicity of this, and I guess it's likely to work, but I feel like it's not really in line with the other uses of afifne.map? I think of affine.map as something that tells me how a given bit of code should iterate over the elements of a tensor, where a packing is really more about a transformation from one tensor type to another, and then you do SIMD operations on the entire tensor, never really touching individual elements.

j2kun commented

That's true, and it would break the lowerings from linalg. And now that I read the Fhelipe paper, I think long term we'd want a custom attribute to represent the layout/packing.

asraa commented

I agree about the packing. I was drafting this out with Lawrence the other day and think we probably want some kind of operation that will eventually lower down to a series of fancy lwe.rlwe_encodes

(secret wrapped code)
%0 = arith.constant [[*]] : tensor<32x32i4>
%1 = tensor_ext.encode %0 { diagonal_encoding } : tensor<32x32xi4, encoding=#diagonal>
%2 = tosa.matmul/stablehlo.dot_general/etc(%arg0, %1)  

edit: maybe we don't need a tensor_ext operation to model diagonally emebedding a tensor, maybe we just lower the matmul/dots in secret-to-scheme lowerings directly to the RLWE operations

Hi everyone! I've been working on a representation and layout assignment protocol that utilizes diagonal packing. Specifically, my representation builds on top of the layout abstraction defined in Viaduct, and includes additional rules to use the Roll operation (and thus diagonalization) to find more optimized compute kernels for linear algebra operations (e.g., matmul or conv).

Rolls are equivalent to the numpy.roll operation, where one dimension is added to another, modulo the length of the target dimension. For example, we can represent the diagonal packing of a square tensor using a Roll.

Visualized, we want to go from a row-major packing to a diagonal packing.

[0, 1] 		=>                  [0, 1, 2, 3]              =>             [0,3,2,1]
[2, 3]		                      

The corresponding layout representation would look like this:

[0, 1]		=>           [0:2:1][1:2:1]                =>           Roll(0,1)[0:2:1][1:2:1]   	    
[2, 3]                  

[0:2:1] represents indexing into dimension 0 with an extent (length) of 2 and a stride of 1.

To interpret the layout representation, we can think of each dimension as a loop body, and we'll index into the original tensor based on produced indices.
[0:2:1] => [0, 0, 1, 1] => [0][0], [0][1], [1][0], [1][1] => [0, 1, 2, 3]
[1:2:1] => [0, 1, 0, 1]

Applying a Roll(0,1) would be adding the dimension at index 1 to the dimension at index 0.
Roll(0,1) [0:2:1] => [0, 1, 1, 0] => [0][0], [1][1], [1][0], [0][1] => [0,3,2,1]
              [1:2:1] => [0, 1, 0, 1]

One cool application once you're able to reason about rolls is that we can represent ciphertext-ciphertext square matrix multiplication in a compact packing (particularly if tensor sizes are much larger than ciphertext sizes as with Transformer models)!

e.g., (a @ b) where a is [64,64], and b is [64,64], and n is 4096.
a: Roll(0,1) Roll(0,2) [1:64];[0:64][64]
b: Roll(0,1) Roll(0,2) [0:64];[64][1:64]

Starting from a compact row-major packing [0:64][1:64], we can apply cheap permutation operations to get to a desired input for such a tensor operation.

[0:64][1:64]
=> [64];[0:64][1:64] (creating 64 copies)
=> Roll(2,1) [64];[0:64][1:64]
=> Roll(2,0) Roll(2,1) [64];[0:64][1:64]

Because rolling is essentially addition and commutative, we can swap rolled dimensions for free, thus getting our layout back to the desired representation.
Roll(2,0) Roll(2,1) [64];[0:64][1:64] == Roll(0,1) Roll(0,2) [1:64];[0:64][64]

Implementation Proposal

@asraa and I spent some time discussing how to implement the Halevi-Shoup Matrix Multiplication in the context of machine learning. Here's what we came up with, concretely.

Where do we lower from?

I know @j2kun mentioned lowering from linalg.matmul (or linalg.quantized_matmul). Currently, lowering from machine learning front-ends goes from StableHLO to linalg (with nothing in between) or TOSA to linalg (again, with nothing in between). Ideally, linalg.matmul would be the right operator to lower; after all, both StableHLO and TOSA lower to linalg.matmul. But there's a problem, which is that when lowering from tosa.fully_connected to linalg, we get something like this in an intermediate representation:

%c-128_i32 = arith.constant -128 : i32
%c0_i32 = arith.constant 0 : i32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<4x4xi8>
linalg.transpose ins(%arg1 : memref<4x4xi8, strided<[?, ?], offset: ?>>) outs(%alloc : memref<4x4xi8>) permutation = [1, 0] 
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<1x4xi32>
linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg2 : memref<4xi32, strided<[?], offset: ?>>) outs(%alloc_0 : memref<1x4xi32>) {
^bb0(%in: i32, %out: i32):
      linalg.yield %in : i32
}
linalg.quantized_matmul ins(%arg0, %alloc, %c-128_i32, %c0_i32 : memref<1x4xi8, strided<[?, ?], offset: ?>>, memref<4x4xi8>, i32, i32) outs(%alloc_0 : memref<1x4xi32>)

Basically, there are extra operations before the matmul such as transpose that we would have to account for, making it complicated to correctly lower both StableHLO and TOSA (and possibly more in the future!) to a Halevi-Shoup matrix multiplication.

Instead, our idea is to lower tosa.fully_connected and StableHLO to another intermediate representation (maybe we'll call it "matvec" or "tensor_ext.matvec") and then lower that to an implementation of our Halevi-Shoup matrix multiplication. In this way, we can account for both StableHLO and TOSA with a middle layer that unifies both abstractions. In theory, this middle layer could even be lowered to linalg itself for plaintext only computations, but we can restrict this lowering to only operate inside secret.generic wrapped TOSA or StableHLO IRs.

As I'm writing this, one thing I realize that this does not cover is what happens when there are programs we are trying to compile which perform a generic matrix multiplication that we would like to use Halevi-Shoup matrix multiplication for.

What do we lower to?

Next, a question is what do we lower the "matvec" IR to? We think it's best to lower to arith, so that later passes (namely, a future secret-to-ckks or secret-to-bgv) can both convert the arith operations (wrapped in a secret.generic) to either the CKKS, BGV, or some other scheme dialects themselves. This might look like the following:

Starting with TOSA:

// bias, which will be converted to an arith.constant
%1 = "tosa.const"() {value = dense<some constants> : tensor<16xi32>} : () -> tensor<16xi32>
// weights (here, it is specified at compile time), which will be converted to an arith.constant
%2 = "tosa.const"() {value = dense<more constants> : tensor<16x16xi8>} : () -> tensor<16x16xi8>
secret.generic(%secret_vector: secret.secret<tensor<1x16xi8>>) {
^bb0(%converted_secret_vec : tensor<1x16xi8>):
  %3 = "tosa.fully_connected"(%converted_secret_vec, %2, %1) : (tensor<1x16xi8>, tensor<16x16xi8>, tensor<16xi32>) -> tensor<1x16xi32>
}

We'll convert the tosa.fully_connected to a "matvec", then when we lower the "matvec", we'll find its parent parameter for the multiplying matrix (which are weights) and rewrite that matrix in a diagonal form, then expand out the multiplication.

%bias_vec = arith.constant <some constants> : tensor<16xi32>
// diagonal matrix now in a 16 by 16 form
%diagonal_matrix = arith.constant <diagonalized attribute, some constants> : tensor<16x16xi32, encoding = #diagonal>
%c1 = arith.constant 1 : index
secret.generic(%secret_vector: secret.secret<tensor<1x16xi8>>) { 
^bb0(%converted_secret_vec : tensor<1x16xi8>):
  %sum = %bias_vec
  %rotated_vec = %converted_secret_vec
  for i = 0 to 15 {
      %extracted = tensor.extract_slice %diagonal_matrix[the right slice] : tensor<16xi32>
      // this is a ctxt-ptxt multiplication, will convert to an encode when lowering secret to scheme
      %multiplied = arith.muli %extracted, %rotated_vec : tensor<16xi32>
      %sum = arith.addi %multiplied, %sum : tensor<16xi32>
      %rotated_vec = tensor_ext.rotate %rotated_vec, %c1
  }
  %extracted = tensor.extract_slice %diagonal_matrix[slice 15] : tensor<16xi32>
  %multiplied = arith.muli %extracted, %rotated_vec : tensor<16xi32>
  %sum = arith.addi %multiplied, %sum : tensor<16xi32>
}

For now, the rotations are done one at a time, which is suboptimal for parallelization, but might be fine for now. I know about the discussion regarding rotation keys #744.

@edwjchen, I don't totally follow how using the Roll operation allows you to obtain a diagonal representation. A few clarification questions for you:

  • Does Roll(0, 1) of [[0, 0, 1, 1], [0, 1, 0, 1]] mean shifting only the first row (the zero parameter of Roll) by 1 (the one parameter of Roll), or what does that actually mean?
  • How does the [0, 3, 2, 1] representation help us? Wouldn't a representation like [[0, 3], [1, 2]] help us more because we can now perform pairwise multiplications into rotations of our secret vector?
j2kun commented

Do arithmetic FHE approaches to ML use quantization? We did the quantization in those passes for CGGI.

@lawrencekhlim Great questions! I'll start with a primer on the layout abstractions, and then dive into your questions.

How do I read the Layout abstraction?

Let A be a tensor of shape (4,4,4). Let n = 16.
Recall that layouts (e.g., A[0:4:1]; [1:4:1][2:4:1]) correspond to indexing back into the original tensor.

For example, if the layout was A[1:4:1]; [0:4:1][2:4:1]

A[0]; [0][0] => A[0][0][0]
A[0]; [1][2] => A[1][0][2]
A[3]; [2][1] => A[2][3][1]

Note, dimensions before the semicolon determine which ciphertext (if multiple are needed to represent a tensor), and the dimensions after the semicolon determine which slot the value should map to in the ciphertext.

When interpreting Roll(0,1), this represents rolling the 0th indexed dimension with the 1st indexed dimension. (sorry, the numbering is overloaded here).

Example 1:
A[0:4:1][1:4:1][2:4:1]

[0:4:1] is dim 0 
[1:4:1] is dim 1
[2:4:1] is dim 2

Example 2:
[2:4:1][1:4:1][0:4:1]

[2:4:1] is dim 0 
[1:4:1] is dim 1
[0:4:1] is dim 2

What is a Roll?

Does Roll(0, 1) of [[0, 0, 1, 1], [0, 1, 0, 1]] mean shifting only the first row (the zero parameter of Roll) by 1 (the one parameter of Roll), or what does that actually mean?

A Roll(0,1) means to roll the 0th indexed dimension with 1st index dimension. In other words, Roll(0,1) sets the indices of the 0th indexed dimension = (dim 0 + dim 1) % (dim 0).extent.

Given the layout A [1:4:1][0:4:1],
The 0th indexed dimension has a traversal [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3].
The 1st indexed dimension has a traversal [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3].

Roll(0,1) transforms dim 0 to [0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1 , 3, 0, 1, 2].

 ([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
+ [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]) % 4

=> 

[0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1 , 3, 0, 1, 2]

Given the layout A Roll(0,1) [1:4:1][0:4:1],
The 0th indexed dimension has a traversal [0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1, 3, 0, 1, 2].
The 1st indexed dimension has a traversal [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3].

A Roll(0,1) [0][0] => A[0][0]
A Roll(0,1) [0][1] => A[1][1]
A Roll(0,1) [0][2] => A[2][2]
A Roll(0,1) [0][3] => A[3][3]
A Roll(0,1) [1][0] => A[0][1]
A Roll(0,1) [1][1] => A[1][2]
A Roll(0,1) [1][2] => A[2][3]
A Roll(0,1) [1][3] => A[3][0]
...
A Roll(0,1) [3][0] => A[0][3]
A Roll(0,1) [3][1] => A[1][0]
A Roll(0,1) [3][2] => A[2][1]
A Roll(0,1) [3][3] => A[3][2]

Using Rolls in the Matrix-Vector Multiplication example

To start, let's name the numbered tensor M and the lettered vector V. Let's also assume that M and V are intermediate values that come from some prior computation, so there is a cost associated to repacking. Finally, let's assume M is a given as a column-major packing.

M:
[1, 5, 9, 13]
[2, 6, 10, 14]
[3, 7, 11, 15]
[4, 8, 12, 16] 
V: 
[a, b, c, d]
Target:
[1a, 2b, 3c, 4d]
[5a, 6b, 7c, 8d]
[9a, 10b, 11c, 12d]
[13a, 14b, 15c, 16d]

To demonstrate a Roll, let's use V as an example. We want to transform V as shown below.

[a, b, c, d]             Layout representation: V [0:4:1]
 
=> 

[a, b, c, d]             Layout representation: V [4]; [0:4:1]
[a, b, c, d]	
[a, b, c, d]
[a, b, c, d]
     
=>

[a, b, c, d]	         Layout representation: V Roll(1,0) [4]; [0:4:1]
[b, c, d, a]	
[c, d, a, b]	
[d, a, b, c]	

V [0:4:1] represents the base vector V. It has only a single dimension, 0, and extent of 4, and a stride of 1.
To interpret this layout: V [0:4:1] => [V[0], V[1], V[2], V[3]] => [a, b, c, d]

V [4]; [0:4:1] represents the 4 copies of the vector V. [4] is an empty dimension that represents an extent of 4. The semicolon differentiates between ciphertext dimensions (which ciphertext to refer to) and slot dimensions (the slot within the ciphertext).
To interpret this layout:

V [0]; [0:4:1] => V [0]; [V[0], V[1], V[2], V[3]] => [a, b, c, d]
V [1]; [0:4:1] => V [1]; [V[0], V[1], V[2], V[3]] => [a, b, c, d]
V [2]; [0:4:1] => V [2]; [V[0], V[1], V[2], V[3]] => [a, b, c, d]
V [3]; [0:4:1] => V [3]; [V[0], V[1], V[2], V[3]] => [a, b, c, d]

V Roll(1,0) [4]; [0:4:1] represents the 4 copies of the vector V, where dimension indexed at 1 (i.e., [0:4:1]) is rolled with the dimension indexed at 0 (i.e., [4]).
dim 0 has the traversal [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3].
dim 1 has the traversal [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3].

Roll(1,0) transforms dim 1 to [0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1 , 3, 0, 1, 2].

 ([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
+ [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]) % 4

=> 

[0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1 , 3, 0, 1, 2]

To interpret this layout:

V [0]; [0:4:1] => V [0]; [V[0], V[1], V[2], V[3]] => [a, b, c, d]
V [1]; [0:4:1] => V [1]; [V[1], V[2], V[3], V[0]] => [b, c, d, a]
V [2]; [0:4:1] => V [2]; [V[2], V[3], V[0], V[1]] => [c, d, a, b]
V [3]; [0:4:1] => V [3]; [V[3], V[0], V[1], V[2]] => [d, a, b, c]

Note that we can also write V Roll(1,0) [4]; [0:4:1] as V Roll(0,1) [0:4:1]; [4], and the physical layouts would still be the same.

Assuming that M is column-major, we can apply a Roll to M to also orient it into the necessary layout for Halevi-Shoup matrix-vector multiplication.

M Roll(0,1) [1:4:1]; [0:4:1]

[1, 6, 11, 16] 	
[2, 7, 12, 13]	
[3, 8, 9, 14]  
[4, 5, 10, 15]	

For this layout M Roll(0,1) [1:4:1]; [0:4:1],
Dimension 0 has the traversal [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3].
Dimension 1 has the traversal [0, 1, 2, 3, 1, 2, 3, 0, 2, 3, 0, 1, 3, 0, 1, 2].

To interpret this layout:

M Roll(0,1) [0]; [0] => M [0][0] => 1
M Roll(0,1) [0]; [1] => M [1][1] => 6
M Roll(0,1) [0]; [2] => M [2][2] => 11
M Roll(0,1) [0]; [3] => M [3][3] => 16
...
M Roll(1,0) [3]; [3] => M [3][3] => 12

Finally, we can use these abstract layouts to

  1. perform alignment checks (are we multiplying the correct values together or are we creating garbage values that need to be masked out)
  2. determine the output packing
M Roll(0,1) [1:4:1]; [0:4:1]
V Roll(0,1) [0:4:1]; [4]

=>

R [0:4:1]

How does the [0, 3, 2, 1] representation help us? Wouldn't a representation like [[0, 3], [1, 2]] help us more because we can now perform pairwise multiplications into rotations of our secret vector?

Yep, you're right here! In this example, [0, 3, 2, 1] doesn't help us. But we can represent [[0, 3], [1, 2]] using the abstraction with rolls! It's possible that [0, 3, 2, 1] could help us in another example though.

Cost of rolls

Arguably, for this particular example, rolls are very expensive. However, if we assume that both M and V fit into a single ciphertext or plaintext vector, then we can perform rolls using O(2n) adds, rotations, and multiplications. We can even optimize the number of rotations using baby-step giant-step. So there are still some cases where using rolls are useful!

asraa commented

Do arithmetic FHE approaches to ML use quantization? We did the quantization in those passes for CGGI.

Here's one complication with out CGGI quantization approach. These quantize weights to i8's and biases to i32 (which even if I use qKeras, I can't use TFLite quantizer to quantize to lower than 32-bit biases even if I can get weights to i4s), which means that arithmetic FHE compilation would have to handle the conversion from i32 back to i8 quantizated values.

I'm still experimenting with quantization methods to see if I can do all i32 quantization, I don't know whether if we choose a nice scaling param for an f32 model we would get something accurate or not.

asraa commented

I know @j2kun mentioned lowering from linalg.matmul (or linalg.quantized_matmul).

Another weird complication came up. It turns out that recently TFL -> TOSA lowers tfl.fully_connected now to tosa.conv2d which now lower to linalg.conv_2d_nhwc_fhwc. See this PR: tensorflow/tensorflow#62310

Matching at the TOSA level definitely seems fragile now.

Does StableHLO also do the same thing?

asraa commented

A stablehlo fully connected layer looks like

    %1 = stablehlo.dot_general %0, %cst_12, contracting_dims = [1] x [0], precision = [DEFAULT, DEFAULT] : (tensor<1x1xf32>, tensor<1x16xf32>) -> tensor<1x16xf32>
    %2 = stablehlo.add %1, %cst_2 : tensor<1x16xf32>

and then lowers with "-stablehlo-legalize-to-linalg" to

    %1 = tensor.empty() : tensor<1x16xf32>
    %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<1x16xf32>) -> tensor<1x16xf32>
    %3 = linalg.matmul ins(%0, %cst_13 : tensor<1x1xf32>, tensor<1x16xf32>) outs(%2 : tensor<1x16xf32>) -> tensor<1x16xf32>
    %4 = tensor.empty() : tensor<1x16xf32>
    %5 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%3, %cst_3 : tensor<1x16xf32>, tensor<1x16xf32>) outs(%4 : tensor<1x16xf32>) {
    ^bb0(%in: f32, %in_14: f32, %out: f32):
      %23 = arith.addf %in, %in_14 : f32
      linalg.yield %23 : f32
    } -> tensor<1x16xf32>

Which is pretty clean!

But, I can't actually generate a fully stablehlo IR yet since theres some tf ops that can't be legalized. The StableHLO tooling is honestly all over the place in terms of support I'm noticing

asraa commented

Looking at the above IR, maybe you can target linalg.matmul and we can write rewrite patterns converting some of the other linalg ops we're noticing into linalg.matmuls?

Yeah, I'm starting to think that's a better option, but would linalg.conv_2d_nhwc_fhwc also be transformed to linalg.matmul?

asraa commented

Yeah, I'm starting to think that's a better option, but would linalg.conv_2d_nhwc_fhwc also be transformed to linalg.matmul?

I found a new pathway. Given a TOSA model containing the new tosa.conv2d appearing from tfl.fully_connected, we can run the following MLIR passes:

blaze run -c dbg mlir-opt --  --allow-unregistered-dialect --pass-pipeline='builtin.module(func.func(tosa-optional-decompositions,tosa-to-linalg-named))' hello_world_float.tosa.mlir

Turns out tosa-optional-decompositions converts the tosa.conv2d back to a simplified fully_connected layer, and the resulting linalg model is

    %6 = tensor.empty() : tensor<1x16xf32>
    %transposed = linalg.transpose ins(%5 : tensor<16x1xf32>) outs(%6 : tensor<1x16xf32>) permutation = [1, 0] 
    %7 = tensor.empty() : tensor<1x16xf32>
    %8 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel"]} ins(%4 : tensor<16xf32>) outs(%7 : tensor<1x16xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<1x16xf32>
    %9 = linalg.matmul ins(%arg0, %transposed : tensor<1x1xf32>, tensor<1x16xf32>) outs(%8 : tensor<1x16xf32>) -> tensor<1x16xf32>

(Note that the bias values are initialized in the out tensor %8, I'm sure I can decompose this further though)

So we're back again at a path to linalg.matmul