arcee-ai/mergekit

Evolutionary Merging Method

codelauncher444 opened this issue Β· 19 comments

Hi! Please consider implementing Evolutionary Merging Method

https://arxiv.org/abs/2403.13187

Was just about to link this paper in case @cg123 hadn't seen it yet!

I'd be interested to know what the norm was for the matrix W they trained to go between the pass through layers (re: my question a few days ago about the possibility of downscaling the K or Q weight matrices for duplicated layers). They report a 20% drop in performance with this removed.

Also if anybody interested in extending their ideas:

For very homogeneous sets of parameters (like their W matrix) , the SPSA algorithm will likely be a better choice than CMA-ES and converge much faster:

https://en.m.wikipedia.org/wiki/Simultaneous_perturbation_stochastic_approximation

and has already been shown to work quite well with LLMs/transformers:

https://arxiv.org/abs/2305.17333

It's sad SPSA was named so badly, as unless you are dealing with very noisy evaluations and/or heterogeneous parameters; it nearly always outperforms CMA-ES and similar algorithms (Cross Entropy Method, etc).

Hi @jukofyork, I am from the Arcee-ai R&D team. I like your ideas, and I wonder whether we can discuss them further. We are happy to collaborate on this :)

@mkshing @hardmaru - shall we all give it a go?

@mkshing @hardmaru - have you considered opensourcing your algo code in addition to the evals in https://github.com/SakanaAI/evolutionary-model-merge

The community would greatly benefit from iterating on this together

I'm also interested in reproducing Evolutionary Model Merge, if there is a group forming around this I'd love to be included.

Hi @jukofyork, I am from the Arcee-ai R&D team. I like your ideas, and I wonder whether we can discuss them further. We are happy to collaborate on this :)

Hi @shamanez, sorry I missed your message.

I'm only really into this out of interest - I lack the compute to try anything serious and not very motivated to learn Python and Pytorch :)

I'm happy to outline some other ideas I've had for anyone who's interested (mostly regarding ensembles):

  1. It seems doubling and halving the ROPE base frequency can have a dramatic effect on the model's output - this is especially noticeable when you double-stack models like I mentioned above: it seems to change the verbosity and "confidentiality wrongness" of the models very dramatically. This makes me wonder if you could "self ensemble" copies with the ROPE frequency altered like this and possibly each will see a differently "dilated" version of the same sequence, etc. If this shows promise then it would be worth trying other non-linear scaling of the position embeddings, etc. It might also be interesting to see if the word embeddings could be altered in a similar way (perhaps via a mutated matrix decomposition, etc).
  2. The next thing I think might be worth trying would be some kind of "horizontal" merging: get two models with an identical architecture and make all the matrices block diagonal (ie: model 1 top left, model 2 bottom right, zeros on the off diagonal). This untrained would act like a "late fusion" ensemble but with further training could be an interesting mixture. It would cost 4x the memory though, but in theory it might be possible to use diagonal and/or low rank approximations for the off diagonals of the block diagonal matrices.
  3. Ensembles of models with different tokenizers: I did make a thread about this a while back and concluded the different tokenizers and prompt templates would make it impossible, but I'm not sure this really is impossible now. The different prompt templates problem can easily be got round by just making each model see the text in the format it expects (other than perhaps some crazy template like codellama-70b uses). This paper (https://arxiv.org/abs/2401.10491) suggests a way to join different tokenizers, but I think there is actually a much better way to do this dynamically at extra computational cost: basically have each model generate their categorical distribution for the next token and then selectively get one or more of the models to roll out further tokens to generate a tree of probabilities.

It's very hard to explain (3) here, but I'm pretty sure it is quite possible to impliment and am happy to try to better explain the idea to someone seriously interested in the idea using some online whiteboard or something... If you have ever played online poker and have seen the graphs of "expected profit" that smooth out the all-in randomness; I was the first person to impliment this around 15 years ago if that means anything so do have experience of dealing with stuff like this. I can't be certain it will work, but from exploring the different tokenizers for some of the common models it looks like most have a large overlap and often just break up longer words differently ("INSTRUCTION" vs "INST" + ...).

One final thing with regard to (3) too, also related to poker game trees and Beam Search: when people first tried to roll out expected value calculations into the next hand(s) they tried to just roll out the maximum probability successors at each node. This had the effect of creating biased probability estimates where the most likely sequence of action was: fold, fold, fold, etc. The current problem of using Beam Search for LLMs creating wierd/repeating sequences is likely a manifestation of this same problem! The solution in the case of poker was to roll out the trees probabilisticallly so that low-probability successors had a chance of being rolled out near the top of the tree too. So even if the idea of merging tokenizers in (3) doesn't work; there may well be other interesting avenues of research to explore regarding tokenizer trees.


One other idea that I think might be worth exploring (although unrelated to Merging or Ensembles) would be to look at Tensor Decomposition of instruction-tuned versions of models in a similar way to this:

https://github.com/thomasgauthier/LoRD

uses SVD to create LORAs from existing models.

If you search for "randomised tensor decomposition" or "power iteration tensor decomposition" there are lots of recently developed algorithms for this, eg:

https://arxiv.org/abs/1703.09074

There is a ton of research on this and two main variations on the decomposition: "CP" which is like SVD with the singular values multiplied into U or V (ie: similar to matrix decompositions created by SDG often used in recommendation systems), and "Tucker" which is like SVD with an actual core tensor similar to S in SVD. The latter might be the most interesting for if we want to explore the "dimensions" in a similar way to PCA, etc.

If it is computationally feasible, it would be really interesting to see if the decomposition could separate off "instruction following" from "alignment" dimensions, and could possibly even improve the performance of the models in the process!

(With regard to the OP). One thing that bothers me about the general idea of merging (and fine-tuning) models is exactly what are we trying to optimise?

Metrics like perplexity make sense if we just want to optimise the base models' auto-regressive ability to predict the next token, but it seems far from clear what we should be optimising for if we want to merge instruction-tuned models, etc.

It's quite easy to subjectively tell you have destroyed a model or lessend its context, but to actually optimise it we really need some kind of fine-grained objective measure to test against... The actual details of the optimization algorithm are really secondary to having a good optimization criterion to work against IMO.

@jukofyork Thanks for your ideas. These are great. Do you have any lead on their use of an evolutionary algorithm to optimize the weight space merging and the data for space merging?

@jukofyork @NolanGC @codelauncher444

I am happy to arrange a meeting. We also have some ideas on how to start this. shamane@arcee.ai

@jukofyork Thanks for your ideas. These are great. Do you have any lead on their use of an evolutionary algorithm to optimize the weight space merging and the data for space merging?

There are a couple of ideas I can think of related to serial dependantcy:

  1. Try to find a way (via fine-tuning, introducing another matrix like this paper's W, etc) to lessen the impact of reordering the donor transformer blocks within their own LLMs first.
  2. Optimise in a more top down way instead of all at once (serial dependantcy between the transformers blocks might make local minima that are very hard to escape from). This paper better explains the problem and solution in a different domain:

http://demo.cs.brandeis.edu/papers/icga95.pdf

So the basic idea would be to maintain a "beam" of partial solutions (instead of a population or distribution of full solutions), extend each partial solution by a "block" (somehow), evaluate how well the extended solutions allow for continuation (somehow) and then update the beam and move one more level down...

This should in theory optimise towards nice flat mimima and allow you to better control the exploration compute (as opposed to quickly getting attracted to a sharp/unescapable mimima).

I've no idea how well it would work in practice, but it would definitely optimise towards a different set of minima that could potentially be interesting and/or complimentary (for ensembles, etc).

We (at Arcee) are working on getting ahold of Sakana to join on this

@jukofyork there certainly might be something more direct like (1) - I know this is a view also held by @thomasgauthier and @shamanez

It seems to me like the best course might be for @cg123 to suggest how we might mock the methods + SDK interface, then we should just all kick off a branch together off of that

[Skip to tl;dr to save time]
Since 8 months ago, I've had a human-feedback based evolutionary merge system on Git known as Model-REVOLVER:
https://github.com/Digitous/ModelREVOLVER
The user provides a prompt in prompt.txt, selects two models to merge, and the number of cycles to attempt random search optimizations. Per cycle, per-layer, each merge is absolutely random % between the two models. When the num_cycles the user predefines is finished building and testing every offspring mutation, they are presented report containing five inference completions from each cycle based on their input prompt.txt. When a user decides they like a specific cycle's alignment [gauged VIA the 5 samples per cycle], they simply input the numbered cycle that best meets their interest and the script rebuilds that model and saves it.

I also have a similar unreleased script that optimizes for the lowest perplexity score VS the PTB dataset through randomized interlayer % merging between two models with light sampling and CTX for speedier cycle completion. This script requires some refining as my last archive of it isn't where it had been with the latest build. The idea is adapting it to allow any custom dataset to be optimized against. Eventually, random search would be hybridized with a less random method, Pearsonr can theoretically use the samples of % merges per layer VS sampled perplexity scores of every cycle to plot them as data points and calculate a more absolute optimum lowest perplexity score layer merge ratio set; treating the original method as calibration sampling. Though I need to double check Pearsonr is the best method for that.

tl;dr,
I will begin picking apart the SakanaAI paper as well as their repo to compare and contrast what their implementation is VS my thoughts on evo-merging. No doubt they have some best practices to learn for our implementation. I'll report any findings and share any scripts soon as I get some traction on how they've accomplished their work.

SakanaAI
Paper: https://arxiv.org/abs/2403.13187
Repo: https://github.com/SakanaAI/evolutionary-model-merge

Much appreciation for everyone working together on this! I was excited when I saw SakanaAI's paper and I hoped that the fine people behind the mergekit project would tackle implementing it.

Thank you for all your hard work on this project. The open LLM ecosystem wouldn't be nearly as diverse or interesting right now if not for your efforts.

I'm happy to outline some other ideas I've had for anyone who's interested (mostly regarding ensembles):

Just saw this linked on Hacker News (it's from February but somehow it never showed up when searching for LLMs+ensembles, etc):

https://arxiv.org/abs/2402.05120

This already shows quite large improvements and is using a very crude measure of similarity:

https://aclanthology.org/P02-1040.pdf
https://en.wikipedia.org/wiki/BLEU

To put this into context: using n-grams like this is like trying to decide which action to take in a game of poker where all you get to see is short snippets of action between 2-3 players ignoring all that went before and all that went after! Imagine trying to judge the EV of your actions in a poker hand without knowing what the player after you did (it's not 100% like this as there are likely lots of "semantic transpositions" with the word order; in a similar way to multiple chess moves leading to the same position).

If anybody is seriously interested in exploring tree-rollout based ensembles then I'm definitely interested too! The biggest hurdle to me is that everything is written in Python, which in turn relies so much on opaque libraries underneath. To get this to work the very first thing is going to be investigating customised code to allow the KV cache to be rolled back/forward/stored, etc efficiently.

I think llama.cpp would be a good place for this, but it seems the code is very unstable and changes every week as new model architectures get hacked in and seems to get ever more complex as time goes on... ☹️

A more stable (even semi-dead) project that just implements the CUDA kernels in unquantized float16 or float32 would probably be best - it needs to be reasonably efficient but not necessarily targeted at running large quantized models to start with.

If that's not possible then I think even a crappy proof-of-concept CPU based implementation might be worthwhile: if it shows promise using small sub-7b models then it will likely attract more interest anyway and expand from there.

@jukofyork there certainly might be something more direct like (1)

I'm pretty sure I read a paper a few weeks ago about trying to shuffle the order of the blocks during training to break the serial dependantcy and encourage them to adapt to be more independent of their surrounding blocks, but I can seem to find it and don't seem to have saved it anywhere :(

I think the paper and/or the architecture they created had "sort" in the title, and it was likely linked to this paper:

https://arxiv.org/abs/2402.02622

This is the paper:

https://arxiv.org/abs/2309.08968

I haven't time to read it again, but can't see anything about shuffling the block order so that could actually be a paper it references that mentioned this (I see it references "LayerDrop", etc).

Hi, this might be a bit of a late reply, but I've been working on something similar in my repository, which you can find here: https://github.com/Aratako/Task-Vector-Merge-Optimzier
As I am Japanese, the repository and all comments are in Japanese, but I hope it can still be of some help or provide useful insights for your work.
My work focuses on optimizing the addition ratios of Task Vectors, utilizing Optuna for optimization (supporting TPE and CMA-ES).
For evaluation, I use task values based on lm-evaluation-harness and LLM as a judge-based values from MT-Bench.
I believe there might be some similarities to what you are aiming for, and I would be glad if any part of my work could assist you in your endeavors.

Hello, if anyone is actively working on the problem - I'd be happy to join in

Is anyone able to reply to this thread regarding where things left off or what came up for them?

I see there is a local implementation at https://github.com/arcee-ai/mergekit/blob/main/docs/evolve.md but it doesn’t seem yet mentioned anywhere.

Has anything supplanted mergekit or evolutionary merging in the past few months here, that activity has waned?