Update docs for > 2048 token models (SuperHOT)?

Question

Update docs for > 2048 token models (SuperHOT)?

Opened this issue a year ago · 12 comments

Hola, forgive me if this is answered elsewhere.

I'm wondering if there are any special considerations for training a 33b (or other) model that has been updated to use the positional embedding compression technique described by the SuperHOT method?

I'd like to start training my loras on the 33b llama variants that support 8K context length, but would be concerned about doing a training run if the code was artificially limiting the samples to 2048.

Are there any special considerations? And, in the spirit of making the repo better, would it make sense to add a note to the README.md to explain this for others who might have the same question?

Cross-posted to reddit: https://www.reddit.com/r/LocalLLaMA/comments/14lr93d/finetuning_with_alpaca_lora_4bit_on_8k_context/

Answer 1 · 2023-06-29T05:35:12.000Z

Thanks for your information! I think this idea is quite simple (although it does not reduce the VRAM cost). What we need to do is just replace RotaryEmbedding with ScaledRotaryEmbedding (not sure where the code is) and download the finetuned model (e.g. TheBloke's model). Then we can train LoRA with the same script as before.

p.s.
Code from https://kaiokendev.github.io/til#extending-context-to-8k

class ScaledRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
        self.register_buffer("inv_freq", inv_freq)
        
        max_position_embeddings = 8192

        # Build here to make `torch.jit.trace` work.
        self.max_seq_len_cached = max_position_embeddings
        t = torch.arange(
            self.max_seq_len_cached,
            device=self.inv_freq.device,
            dtype=self.inv_freq.dtype,
        )

        # These two lines:
        self.scale = 1 / 4
        t *= self.scale

Answer 2 · 2023-07-02T02:34:17.000Z

@johnsmith0031 I've tried it and can't do anything to get the lora to perform better then llama in exllama scaled inference. Am I missing any modules? ["q_proj", "k_proj", "v_proj", "o_proj"]? I've also got all biases. Maybe I'd have to train the rotary_emb module for each layer? Any ideas?

maybe @kaiokendev ?

Answer 3 · 2023-07-02T03:02:35.000Z

@Jeduh Can you clarify what problem you are facing? I don't think more exported modules should impact whether it works or not

Answer 4 · 2023-07-02T03:57:26.000Z

If you have any time could you please check over the 2 commits I made in this fork. Somethings blocking it from learning how to scale.

Answer 5 · 2023-07-02T06:19:30.000Z

It looks mostly fine to me. Is the loss not converging properly?

Answer 6 · 2023-07-02T11:37:16.000Z

Thank you for taking a look :)

Well it seems to always fall bellow 1.1 at around 500 steps and converges around 1.0 avg at 2-3k steps. It looks pretty normal but no matter how much I tune the lr's decays and all that to get a lower converging curve, almost no learning gets applied in scaled inference, so weird.

Answer 7 · 2023-07-02T12:21:34.000Z

Do they look strange?

Answer 8 · 2023-07-02T12:43:52.000Z

I think its due to LoRa output or intake at inference.
I remembered ExLlama always showing !! Warning: LoRA zero bias ignored which comes from this code:
self.bias = tensors[key + ".bias"] if has_bias else None

Seems that the 'all' bias I set, Isn't present after training?

This is the 2x scale LoRa adaptor

I have no idea but I'm all ears.

Answer 9 · 2023-07-02T15:19:53.000Z

Do they look strange?

I'm not sure how to interpret since there is no labels, but I am assuming pink line is the loss? So it is definitely learning. As for bias, you can ignore it as this repo does not export the bias tensors by default (they will all be 0, there is a code you can change but there's no need, just set bias to none in your LoRA).

I loaded the LoRA in your post in text-generation-webui, set max_seq_len to 4096, and the compress_pos_emb to 2 and with 3271 context it was outputting fine, so I am not sure I see the same problem? Do you have up-to-date version of Exllama? I believe the CUDA allocation was causing some problems before version 0.0.4

Answer 10 · 2023-07-02T15:24:31.000Z

Oh right sorry, yeah, top is eval/loss, bottom is train/loss. Ill try updating, thanks for letting me know! that makes me happy :)

Answer 11 · 2023-07-02T17:04:11.000Z

@kaiokendev I've been using test_benchmark_inference.py with updated exllama but perplexity's are still in line without -ld lora. I tried it out in Ooobabooga and it does seem to work but maybe I'm just not using the right template?

how do you test perplexities Kaioken?

Answer 12 · 2023-07-04T04:08:52.000Z

I use the perplexity code from huggingface, usually with stride = max length. Stride of 1 is the sliding window used in ALiBi paper while stride of 256 is used in Meta's interpolation paper. Stride = max length is the perplexity score usually reported in most papers for the foundation models

https://huggingface.co/docs/transformers/perplexity#example-calculating-perplexity-with-gpt2-in-transformers