initialization of PHM layers

Question

initialization of PHM layers

dorooddorood606 opened this issue 4 years ago · 4 comments

Hi

I would like to initialize phm_rule and weights in a way that the final weight matrix of PHM layers is initialized with normal(mean=0, std=0.01), could you kindly provide me with some suggestions on how this can be achieved? So which initialization I can use for phm_rules and weight variables.

thanks

Answer 1 · 2021-05-11T17:40:17.000Z

Hi @dorooddorood606 , you can achieve such initialization with the following code:

from benchmarks.utils import set_seed_all
from phc.hypercomplex.layers import PHMLinear
import torch

# Initialize the final weight matrix following a certain distribution
device = "cuda:0" if torch.cuda.is_available() else "cpu"

set_seed_all(seed=43)
phm_lin1 = PHMLinear(in_features=128 // 2, out_features=256 // 2, phm_dim=4, w_init="phm", c_init="standard").to(device)

for w in phm_lin1.W:
    w.data.normal_(mean=0.0, std=0.01)

for w in phm_lin1.W:
    print(w.std())

# tensor(0.0100, device='cuda:0', grad_fn=<StdBackward0>)
# tensor(0.0101, device='cuda:0', grad_fn=<StdBackward0>)
# tensor(0.0099, device='cuda:0', grad_fn=<StdBackward0>)
# tensor(0.0099, device='cuda:0', grad_fn=<StdBackward0>)

If you want to modify the phm_rules, you can iterate over phm_lin1.phm_rules and retrieve the data attribute, like:

for w in phm_lin1.phm_rule:
    w.data.normal_(mean=0.5, std=0.1)

for w in phm_lin1.phm_rule:
    print(w)

# Parameter containing:
# tensor([[0.6034, 0.5514, 0.4601, 0.7307],
#         [0.5802, 0.4613, 0.4960, 0.6374],
#         [0.6922, 0.5066, 0.5063, 0.4360],
#         [0.5713, 0.3694, 0.5513, 0.4803]], device='cuda:0', requires_grad=True)
# Parameter containing:
# tensor([[0.3592, 0.5751, 0.5850, 0.5287],
#         [0.4716, 0.4622, 0.5230, 0.5109],
#         [0.4808, 0.3467, 0.5735, 0.5904],
#         [0.4408, 0.5532, 0.5885, 0.5192]], device='cuda:0', requires_grad=True)
# Parameter containing:
# tensor([[0.3816, 0.6542, 0.3359, 0.4211],
#         [0.6865, 0.3759, 0.5291, 0.5276],
#         [0.6018, 0.5565, 0.4768, 0.6355],
#         [0.5029, 0.5969, 0.6655, 0.3873]], device='cuda:0', requires_grad=True)
# Parameter containing:
# tensor([[0.5919, 0.5583, 0.3676, 0.5180],
#         [0.5897, 0.3686, 0.4941, 0.6941],
#         [0.6832, 0.6234, 0.3679, 0.2792],
#         [0.4790, 0.4572, 0.4511, 0.5616]], device='cuda:0', requires_grad=True)

Answer 2 · 2021-05-13T20:26:32.000Z

Hi
Thank you for the response. Sorry for the misunderstanding. What I meant was if we could intialize the components of phm_rule and W in PHM layers in a way that final weight matrix which approximates the linear layer be close to normal(mean=0, std=0.01) inialization. So lets assume we compute the H = \sum_i(phm_i \odot W_i) how can we have H initalized as normal by initalizing phm_i and W_i elements. thanks a lot for any suggestions in advance

Answer 3 · 2021-05-14T09:03:43.000Z

Hi @dorooddorood606 , I need to think more about how we can formulate this problem, to get a precise initialization scheme, but you could start with the following code and test out different std for the W tensor, i.e., the weight-matrices.

import torch

from benchmarks.utils import set_seed_all
from phc.hypercomplex.layers import PHMLinear
from phc.hypercomplex.kronecker import kronecker_product_einsum_batched


set_seed_all(42)
phm_dim = 4
in_feats = 256
out_feats = 256
in_feats_axis = in_feats // phm_dim
out_feats_axis = out_feats // phm_dim

# fix this (corresponds to the phm-rules, i.e., the C_i in the paper
C = torch.randn(phm_dim, phm_dim, phm_dim).normal_(0, 0.1)

# try out here
W = torch.randn(phm_dim, in_feats_axis, out_feats_axis).normal_(0, 0.05)

H = kronecker_product_einsum_batched(C, W)
HH = H.sum(0)
print(HH.mean())
print(HH.std())
# tensor(2.9075e-06)
# tensor(0.0087)

If you found an approximate std for initializing the W_i matrices, then you can use the code I sent you earlier, to init the W-matrices. As of now, the standard deviation for the phm-rules (C_i) are fixed initialized with standard deviation 0.1 -
Generally, the final standard deviation for the H-matrix (after sum of Kronecker products, i.e. in the code, the HH object) can be computed by computing the standard deviation of the vectorized version of the sum of Kronecker products. But I need to think more about it and write down the equations. I hope this solution helps you, so you can at least try out, and if not, even get the right answer from it by using my hint.

Answer 4 · 2021-05-21T18:15:48.000Z

thanks a lot