The DARE-TIES experiment.
Opened this issue · 4 comments
I just wanted to pass on some "lab" results using dare-ties and mistral nemo.
I created a triple dare-ties merge of 3 pass-through "instruct/fine" models.
Each instruct/fine tune uses the same merge format:
slices:
- sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
layer_range: [0, 14]
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
- sources:
- model: G:/11B/Rocinante-12B-v1.1
layer_range: [8, 24]
parameters:
scale:
- filter: o_proj
value: 1
- filter: down_proj
value: 1
- value: 1
- model: G:/11B/Rocinante-12B-v1.1
- sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
layer_range: [14, 22]
parameters:
scale:
- filter: o_proj
value: .5
- filter: down_proj
value: .5
- value: 1
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
- sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
layer_range: [22, 31]
parameters:
scale:
- filter: o_proj
value: .75
- filter: down_proj
value: .75
- value: 1
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B
- sources:
- model: G:/11B/Rocinante-12B-v1.1
layer_range: [24, 40]
parameters:
scale:
- filter: o_proj
value: 1
- filter: down_proj
value: 1
- value: 1
merge_method: passthrough
dtype: bfloat16
- model: G:/11B/Rocinante-12B-v1.1
THE DARE-TIES:
models:
- model: E:/MN-Rocinante-12B-v1.1-Instruct
- model: E:/MN-magnum-v2.5-12b-kto-Instruct
parameters:
weight: .6
density: .8 - model: E:/MN-12B-Celeste-V1.9-Instruct
parameters:
weight: .38
density: .6
merge_method: dare_ties
tokenizer_source: union
base_model: E:/MN-Rocinante-12B-v1.1-Instruct
dtype: bfloat16
What is interesting here is that EACH TIME I run the "dare-ties" it creates a slightly different or VERY DIFFERENT model, despite no changes in the the models nor the settings.
This shows up in PPL and "real world" tests.
PPL range of 7.7327 to 7.8024 ... and that is on just 10 generations.
Real world testing the "core" changes -> wow.
Attibute, scale, word choice, sentence structure,... changes across the board.
I am not sure if this is a mistral nemo artifact or not.
From these 10, I did some merging of these using breadcrumbs ; wow.
All I can say.
When everything is F32 ... they shine even brighter.
With enough generations + merging of the "best DNA" could create truly legendary model(s).
Just saying - job well done and then some!!!
NOTE: Models for "fine/instruct" and "DARE-TIES" supermerges are posted at my repo.
If DARE-Ties gives dramatically different results each time, maybe I don't understand it correctly, but that sounds less like a good thing and more like a bad thing.
If DARE-Ties gives dramatically different results each time, maybe I don't understand it correctly, but that sounds less like a good thing and more like a bad thing.
This all depends... in my first case it was bad, because I deleted the source and found out the hard way... and it was a great version.
That being said, in creating 10+ versions, the "Dna" of each model can be mapped, and these combined creating stronger models with specific attributes while reducing the negative ones.
One of the open questions is: Does this apply to other archs too? Llama2? 3? 3.1? ...
And some of the other mergekit methods also involve this same type of "random pruning"... too.
I mapped these out after looking at the programming code to verify operations.
A more interesting method or change may be pruning controls for DARE TIES , which limit the range.
Thanks for sharing your results here!
DARE-TIES does have a randomized element, yeah - it's part of the algorithm by design. If you want more reproducible merges you can set a random seed by passing --random-seed <N>
on the command line. I usually do when I'm iterating on a recipe that involves DARE.
Thanks for sharing your results here!
DARE-TIES does have a randomized element, yeah - it's part of the algorithm by design. If you want more reproducible merges you can set a random seed by passing
--random-seed <N>
on the command line. I usually do when I'm iterating on a recipe that involves DARE.
*** Thank you ; that was one of the questions I had ; thanks again ... I think there is so much untapped potential in mergekit yet to be discovered.