Re-Train every block with reduced width

Question

Re-Train every block with reduced width

snapo opened this issue 4 months ago · 0 comments

I have a question:
for example Llama3.1 405B in bf16 would consume 0.82TB of memory.
Is there a compression technik that would allow me to reduce the width by a factor of 16 (so combining 16 bf16 into 1)?

for example llama3.1 405B

i would like to reduce the token embeding weights to 32k,4k
for the first block 1k to 1k (attn_k)
for q also 1k , 1k .... and so on

so for each model layer and each block reducing it and training each one separate.
to reduce we could add on the original block a softmax relu for every 16 output and this is the target to learn for the micro/nano block...

dont know if this is even feasible , its a knowledge transfer to a smaller model, but keeping all 125 layers as i think the layer are for logic reasoning and the block size is just for creativity....

Would appreciate a answer if this is achiveable