LearningRate-Free Learning Algorithm

Question

LearningRate-Free Learning Algorithm

BootsofLagrangian opened this issue 2 years ago · 51 comments

BootsofLagrangian commented 2 years ago

Hi, how about D-adaptation?

This is a kind of algorithm that end-user doesn't need to set specific learning rate.

In short, D-adaptation use boundedness to find proper learning rate.

So, it might be useful to someone who hard to find hyperparameters.

Before I wrote this issue, I implement D-adaptation optimizer(Adam) for LoRA. It works!

A few code need to implementation. But I don't know all about sd-scripts code, there exists hard codings.

Requirement for D-dataptation is only torch>=1.5.1 and pip install dadaptation.

Here are codes.

In train_network.py
from torch.optim as optim # using for a raw learning rate scheduler
import dadaptation
and I hard-coded for applying optimizer.
optimizer = optimizer_class(trainable_params, lr=args.learning_rate)
to
optimizer = dadaptation.DAdaptAdam(trainable_params, lr=1.0, decouple=True, weight_decay=1.0)
Setting decople=True means that optimizer is AdamW not Adam. and weight_decay is for l2 penalty.

Other argumentation is not for end-user.(maybe)

And trainable_params doesn't need a specific learning rate, so replace
trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
to
trainable_params = network.prepare_optimizer_params(None, None)

In sd-scripts, lr_scheduler is a return of get_scheduler_fix function.

But I don't know why using get_scheduler_fix interrupt D-adaptation,

so I override lr_scheduler to LambdaLR. sorry for hard coding again :)

lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 1, lambda epoch: 1], last_epoch=-1, verbose=False)

For monitoring dlr value,

logs['lr/d*lr'] = optimizer.param_groups[0]['d']*optimizer.param_groups[0]['lr']

might be needed. All things done.

This image is d*lr-step graph when I use D-dadaptation.

I trained LoRA using D-adaptation, result is here.

Thank you!

Answer 1 · 2023-02-11T19:58:19.000Z

@BootsofLagrangian Would you be able to fork the repo and commit your changes so it would be easier for plebs like me to follow your changes?

Answer 2 · 2023-02-12T03:07:14.000Z

@BootsofLagrangian Would you be able to fork the repo and commit your changes so it would be easier for plebs like me to follow your changes?

Sorry, I'm not familiar with github. It takes long time to make a fork or repo.

and this changes include hard codes, does it matter if there's something like this?

Answer 3 · 2023-02-12T06:22:00.000Z

When you fork, the code becomes your own, and you can hard code changes into your own copy. But that's ok. Maybe I'll do it, and @ you if I have any problems.

(Also, unless you're on mobile, forking is fast and easy, click fork, click done, and boom)

Answer 4 · 2023-02-12T08:02:09.000Z

When you fork, the code becomes your own, and you can hard code changes into your own copy. But that's ok. Maybe I'll do it, and @ you if I have any problems.

(Also, unless you're on mobile, forking is fast and easy, click fork, click done, and boom)

I made a fork.

I just changed train_network.py and requirements.txt

Answer 5 · 2023-02-12T11:51:58.000Z

Cool stuff

Answer 6 · 2023-02-12T11:59:35.000Z

This should be added as an official feature to the project. Like it!

Answer 7 · 2023-02-12T12:09:16.000Z

@BootsofLagrangian I see that both TE LR and UNet LR are no longer specified. Do you know if Dadatation set both to be the same? Do you know if it is possible to set them to different values is it is the same? For LoRA it used that setting TE to a smaller LR than UNet was better. Not sure how this is doing it for each.

Answer 8 · 2023-02-12T13:22:28.000Z

@bmaltais wouldn't you have to proc it twice with lr=1.0 for UNet and <1 for TE? Since in essence you have two different training problems going on at once?

From the source repo
Set the LR parameter to 1.0. This parameter is not ignored, rather, setting it larger to smaller will directly scale up or down the D-Adapted learning rate.
Sounds like 1.0 and 0.5 would match the settings commonly used (1e-4 and 5e-5)

And maybe Dadaptation is most suited for UNet, since under fitting the text encoder is often desirable.

Answer 9 · 2023-02-12T13:23:13.000Z

@BootsofLagrangian you're awesome! Can't wait to play with it!

Answer 10 · 2023-02-12T14:25:38.000Z

@BootsofLagrangian I see that both TE LR and UNet LR are no longer specified. Do you know if Dadatation set both to be the same? Do you know if it is possible to set them to different values is it is the same? For LoRA it used that setting TE to a smaller LR than UNet was better. Not sure how this is doing it for each.

Yes, using lr argumentation make TE LR and UNet LR different. @AI-Casanova's comment is also right.

I'm not sure, but using get_scheduler_fix function in train_network.py properly is the way to applying LRs differently.

or directly
lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 0.5, lambda epoch: 1], last_epoch=-1, verbose=False)

Answer 11 · 2023-02-12T15:07:22.000Z

I also discovered there are two other adaptative method. I was shocked at how high the SGD method ramped up the LR to (1.03e+00) but the results were still good. My god!

Sample from SGD training:

Answer 12 · 2023-02-12T15:12:10.000Z

Link to python module for reference: https://pypi.org/project/dadaptation/

Answer 13 · 2023-02-12T15:17:10.000Z

I intuitively knew that there must be a way of adjusting learning rate in a context dependent manner, but knew I was far too uninformed to come up with one. This is definitely cool stuff.

Answer 14 · 2023-02-12T16:41:36.000Z

Quick comparison results from DAdaptAdam with TE:0.5 and UNet:1.0:

DAdaptAdam-1-1: loss: 0.125, dlr: 4.02e-05
DAdaptAdam-0.5-1: Loss: 0.124, dlr: 4.53e-05

DAdaptAdam-1-1:

DAdaptAdam-0.5-1:

I think the winner is clear. TE LR need to be half of UNet... but there might be more optimal settings.

Optimizer config for both was: optimizer = dadaptation.DAdaptAdam(trainable_params, lr=1.0, decouple=True, weight_decay=0, d0=1e-6)

I will redo the same test but with an optimizer config of: optimizer = dadaptation.DAdaptSGD(trainable_params, lr=1.0, weight_decay=0, d0=1e-6)

Answer 15 · 2023-02-12T17:48:55.000Z

@bmaltais how did you implement the split learning rate? Or did you run it twice?

Answer 16 · 2023-02-12T17:51:50.000Z

@AI-Casanova I did it with

lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 0.5, lambda epoch: 1], last_epoch=-1, verbose=False)

Answer 17 · 2023-02-12T18:14:06.000Z

@bmaltais awesome! I should have pulled on that thread, but my self taught lr for all things python and ML is already through the roof. 😅

Answer 18 · 2023-02-12T19:17:20.000Z

Here is an interesting finding. For DAdaptSGD having a TE and UNet lambda both at 1 is better than 0.5,1...

DAdaptSGD-1-1:

DAdaptSGD-0.5-1:

I wonder if having a weaker UNet with DAdaptSGD might be even better... like DAdaptSGD-1-0.5

Also, I have not been able to get anything out of DAdaptAdaGrad yet.

Answer 19 · 2023-02-12T19:26:41.000Z

And here are teh results of DAdaptSGD-1-0.5:

I think DAdaptSGD-1-1 is still the best config for that method.

Well... I am looking at the results and I am not so sure anymore... Maybe DAdaptSGD-1-0.5 is better...

Answer 20 · 2023-02-12T19:32:01.000Z

SGD is stochastic gradient descent right? Is that the same concept as SGD=(batch=1)?

Or is SGD scheduling about not having a weight decay like Adam?

Is batch=1 even SGD with Adam?

Primary sources are impenetrable and secondary sources so unreliable on this stuff.

Answer 21 · 2023-02-12T20:04:28.000Z

Good question... I don't really know. But DAdaptAdam-0.5-1 appear to produce the most likeness of all the method... so I might stick with that for now...

Answer 22 · 2023-02-12T20:40:37.000Z

Published 1st model made with this new technique: https://civitai.com/models/8337/kim-wilde-1980s-pop-star

Answer 23 · 2023-02-13T15:10:54.000Z

I'm experiencing what I think is a way overtrained TE, even at 0.5. All styling goes out the window before my UNet catches up.

I have to figure out how to log what the learning rates are independently.

Answer 24 · 2023-02-13T16:26:52.000Z

So @BootsofLagrangian was outputting the TE learning rate to the progress bar and logs, so what I thought was a suspiciously high UNet lr was an insanely high TE lr

Dropped my scale to .25 .5 and trying again.

Answer 25 · 2023-02-13T17:40:56.000Z

Unfortunately it's starting to look to me like I've replaced one grid search with another, with scaling factor in the place of lr

Answer 26 · 2023-02-14T03:10:37.000Z

@AI-Casanova, you might need another learning rate scheduler. My fork use only LambdaLR(identity or scalar scaling).

This is a problem, because of no using get_shceduler_fix function in sd-scripts.

Usually, Transformer models use warmup LR scheduler.

From dadaptation repo, applying LR scheduler using before also works fine.

Answer 27 · 2023-02-14T04:17:11.000Z

@BootsofLagrangian basically what I was seeing is very good likenesses being made, but they were so inflexible.

I think I might have hit the sweet spot at 0.125 0.25 though.

It still adjusts to my datasets, and is in a similar range as before.

Now I'm gonna add a few other ideas to this fork.

Answer 28 · 2023-02-15T22:09:32.000Z

@BootsofLagrangian
I have tried the forked one and it seems to work wrong when the value of network_alpha is not equal to the value of network_dim.
Is it an expected behavior that the smaller the value of network_alpha, the higher the learning rate?

When network_dim=128, network_alpha=1, data was destroyed about 50 steps were executed.

Answer 29 · 2023-02-16T14:42:08.000Z

@BootsofLagrangian I have tried the forked one and it seems to work wrong when the value of network_alpha is not equal to the value of network_dim. Is it an expected behavior that the smaller the value of network_alpha, the higher the learning rate?

When network_dim=128, network_alpha=1, data was destroyed about 50 steps were executed.

D-adaptation use inverse of subgradient of models. If you want more equations, details are in dadaptation paper

LoRA model is multiplicated two matrix with low-rank(r) B and A.

In LoRA paper, alpha and rank used external multiplication terms of model.

Alpha used multiplying model and rank used dividing model.

So, alpha/rank ratio is very directly and sensitively acting on subgradient.

In destroyed case, alpha=1, rank=128, alpha/rank ratio is 1/128. This makes subgradient smaller.

Now, return to D-adaptation. Small subgradient makes learning rate higher. High learning rate blow model up.

Therefore, It is highly recommended alpha and rank set up same value, especially using big(?) rank value.

Thank you for comment and experiments! :)

Answer 30 · 2023-02-16T15:33:30.000Z

Now, return to D-adaptation. Small subgradient makes learning rate higher. High learning rate blow model up.

Therefore, It is highly recommended alpha and rank set up same value, especially using big(?) rank value.

Understood.
If that is the case, it would be better to have a warning when the alpha option is specified small, etc. when actually incorporating the code.

Thanks for your reply!

Answer 31 · 2023-02-17T06:32:48.000Z

Tried out @BootsofLagrangian fork, works really well IMO. Green is the D-Adaptation and Orange is 1e-4 learning rate, 5e-5 for text encoder. Also added regularization images for anime to the green lines. Showing below after 2000 steps. With the --noise_offset=0.1

Answer 32 · 2023-02-18T06:05:28.000Z

Note: some posts related to learning:

#193
#169

Answer 33 · 2023-02-19T10:06:07.000Z

Usage of my fork changes.

Now, using --use_dadaptation_optimzer args activate dadaptation.

and, learning rate, UNet LR, TE LR will available args, but it is not commonly used for LR.

1-digit float value is proper value for LR, UNet LR, TE LR. ex) 1.0, 1.0, 0.5.

Answer 34 · 2023-02-22T13:30:11.000Z

D-Adaptation optimizer is finally implemented. Thank you to @BootsofLagrangian for the PR and thank you all for the great research!

Answer 35 · 2023-04-11T19:28:41.000Z

so glad i can ignore setting LR now, thanks!! @BootsofLagrangian @kohya-ss
had to look a bit, but it looks like it is under the "optimizer" setting

so does this mean if i set this option, i dont need to set LR value / it ignores it?

Answer 36 · 2023-04-11T21:32:27.000Z

Glad you found where the option was. It is indeed nice not to have to specify the LR.

Answer 37 · 2023-04-14T04:50:28.000Z

Really appreciate your efforts, however I seem to have a more success with the original post than the current one. With the original one, the results were great and Im still tweaking it. Do above posts imply that LR increases when applying a dampening factor (like network alpha) , rather than decreases like it normally does?

With the new one, LR seems really low and can't produce results resembling input images, even using 5 times the steps I normally use, but I may have mistweaked it, have you guys had success?

Answer 38 · 2023-04-14T10:47:16.000Z

Interesting. I had a feeling the new d'adaptation was different from the one in the branch... Some day I will see if there is a way I could enable the original method vs using the old branch for the task.

Answer 39 · 2023-04-14T16:08:38.000Z

I have also been having issues with d-adaptation implementation. I originally used it in PR form and it was working well, but tried it recently in various testing I was doing and I couldn't get it to not explode and cause loss=nan. Also learning seems very low, even though it has a decent dlr (lower average magnitude/average strength). Tried upping the unet_lr and text_encoder_lr to 2, 1.15, 1.25 or lower to 0.75 (which I know isn't a multiplier) but still had poor results. Tried optimizer_args of "decouple=True" and/or weight_decay from 0.2, 0.1, 0.01, 1e-4, 1e-5, 1e-6 and nothing seemed to help it improve.

I will try some tests on the code from BootsofLagrangian's branch to compare. I can also try to compare the code and see if there is something that stands out to me.

Answer 40 · 2023-04-21T05:44:24.000Z

@rockerBOO
I found some change btw first fork and now in D-Adaptaion.

Now version only use one learning rate on UNet and TextEncoder.

I think this is correct method following the reference.

So, It is recommend that using lower unet learning rate(eg. 0.5) and using optimizer args "decouple=True" "weight_decay=1.0"

Because D-Adaptation uses boundedness, it will use maximally high dlr(seems like using maximal learning rate).

Anyway, try "weight_decay=1.0" on optimizer_args and lower coefficient of unet/TE lr(eg. 0.5)

Answer 41 · 2023-04-21T19:38:21.000Z

@BootsofLagrangian thanks for taking a look!

Settings:

network_dim=16
network_alpha=8
unet_lr=0.5
text_encoder_lr=0.5
optimizer_type="DAdaptation"
optimizer_args=["decouple=True", "weight_decay=1.0"]
lr_scheduler="constant"
min_snr_gamma=5

tried weight decay 0.5, 1.5 as well.

And the last light blue is without min_snr_gamma but has the same problem

once it starts going up it starts producing noise and goes up fast and never recovers. Using the same dataset and settings (changing the learning rate, and low or no weight decay) with AdamW produces good results so its within the reported settings being changed.

Example of the noise:

Edit: Also noting I'm running batch size: 2, gradient accumulation steps: 24 in these tests. Maybe this is impacting how it is working?

Answer 42 · 2023-04-22T11:53:14.000Z

@rockerBOO

First, rank(dimension) and alpha should be same value with D-Adaptation. α/r ratio has direct impact on learning rate and weight(model). d*lr will increase when α/r decrease. So, controlling α and r value is important and sensitive thing.

Second, I don't have any experiment with min_snr_gamma, but, I think that min_snr_gamma accelerate training, also D-Adaptation too. With combining two method, model explode in earlier step. (And there is some math to understand deeply assumption on D-Adaptation. It suppose model is a kind of Lipschitz function. But SD model doesn't. Therefore mathematically D-Adaptation does not guarantee that automatically chosen lr lead model to convergence. So, D-Adaptation with other speed-up method makes model blow up.)

Third, lr scheduler maybe is a matter. Most of Transformer model(including Stable Diffusion) use learning rate scheduler with warmup or restarts. It help model can update with small amount of weight(ΔW) and can reach to and find global minimum. You might need to consider using lr scheduler with warmup an restarts(I recommend lr_scheduler=cosine_with_restart and lr_warmup=[5~10% of total steps]).

Answer 43 · 2023-04-27T15:45:53.000Z

Thanks for these suggestions @BootsofLagrangian . I am still working through the different permutations and having varying results. Trying to isolate it to specific parameters that may be having a larger impact. I will try to assess and report back.

network_dim=16
network_alpha=16 # match dim
unet_lr=0.5 # the highest value of these will be the learning rate
text_encoder_lr=0.5 # the highest value of these will be the learning rate
optimizer_type="DAdaptAdam"
optimizer_args=["decouple=True", "weight_decay=1.0"] # weight decay may not be necessary, can help with overfitting, play with different values and look up for more info
lr_scheduler="cosine_with_restarts"
lr_warmup_steps=350 # 5-10% of total steps

In my initial findings, 0.5 LR, the matching rank, no min_snr_gamma (mostly to remove a variable) and using warmup and cosine_with_restart seemed to work a lot better. But it's not consistently working better with these options, and trying other options as well so haven't pinned down anything.

I would say a "warmup" would be ideal with d-adaptation in my experimentation as it tampers the dynamic learning rate down. If you have too long or too short of a warmup it can drastically affect the dynamic learning rate, it my limited experience (needs more testing). The cycling learning rate also seems to help tamper down or letting the dynamic learning rate expand somewhat.

Answer 44 · 2023-05-25T03:01:46.000Z

Noob here...
Do you see better results at lower rates? Setting to .5, .5, .25? .... .25, .25, .125? I feel like I do. What about doing 5epocs of 1, 1, .5 and then stopping and continuing at half rate, and repeat. I am training in kohya and am unfamiliar with writing my own step down code, so I am doing it manually. Any thoughts on this process? Is it placebo? Or is it amazing results?

Answer 45 · 2023-05-30T06:54:01.000Z

@DarksealStudios

After some epochs, downsizing learning rate is an useful method and not a placebo effect. Most of learning rate scheduler do that.
If you are interested in such effect, search with these keywords 'local minimum', 'learning rate scheduling', 'decaying learning rate'.

Answer 46 · 2023-05-30T14:42:09.000Z

I only ask BootofLagrangain because of the language used in kohya when the training begins. It confused me with it's wordage... that and once I read up on the schedulers it all seemed like the schedulers already do what I was trying to mimic manually. Thank you for leting me know kohya is not overruling the settings (right?). For example, when I set to 1, .5, 1... the text learning rate is .5, the wordage makes it sound like all settings were changed to .5... I'll have to copy it next time, but I'm sure you know the text I'm talking about, something about using only the "first" setting. Anyway, thank you!

Answer 47 · 2023-06-05T22:11:01.000Z

Hi all - I was wondering how you are specifying different LRs for UNET and the text encoder?
Unless I specify the same values in the UI, I just get the following error :

RuntimeError: Setting different lr values in different parameter groups is only supported for values of 0

Was this something that was changed in a recent update?
There are some other recent reports of this, eg. #555

Answer 48 · 2023-06-05T22:13:13.000Z

@phasiclabs I do believe that the newest version of Dadaptation can only be set to 0/1 for each TE and UNet.

This was the original implementation that allowed for a scalar

Answer 49 · 2023-06-05T22:18:44.000Z

Ah, ok thanks for the info - just found this post too #274
But I'm not seeing that (more informative) message!

Answer 50 · 2023-11-29T07:10:18.000Z

keywords

@BootsofLagrangian thanks for taking a look!

Settings:
network_dim=16
network_alpha=8
unet_lr=0.5
text_encoder_lr=0.5
optimizer_type="DAdaptation"
optimizer_args=["decouple=True", "weight_decay=1.0"]
lr_scheduler="constant"
min_snr_gamma=5
tried weight decay 0.5, 1.5 as well.

And the last light blue is without min_snr_gamma but has the same problem

once it starts going up it starts producing noise and goes up fast and never recovers. Using the same dataset and settings (changing the learning rate, and low or no weight decay) with AdamW produces good results so its within the reported settings being changed.

Example of the noise:

Edit: Also noting I'm running batch size: 2, gradient accumulation steps: 24 in these tests. Maybe this is impacting how it is working?

Have you solved this issue? I meet this

Answer 51 · 2023-11-29T14:47:42.000Z

@wuliebucha I think he fixed that issue by fixing alpha.

Adaptation Optimizer is very sentive to ratio of alpha and rank.

You need to set value of alpha same as value rank. If alpha is lower than rank, model can easily blow up.