[Help] 请问如何能做到微调过程中不保存早期的checkpoint

Question

[Help] 请问如何能做到微调过程中不保存早期的checkpoint

ybdesire opened this issue a year ago · 3 comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

比如，微调模型的配置如下

    --max_steps 3000 \
    --save_steps 5\

这样保存的checkpoint就会从5, 10, 15, 20, ..., 3000。这样就保存太多checkpoint了。

我想跳过step小于2000的部分，就是只保存checkpoint从 2000, 2005, 2010, ..., 3000。请问应该如何配置呢？

Expected Behavior

No response

Steps To Reproduce

    --max_steps 3000 \
    --save_steps 5\

Environment

OS: Ubuntu 20.04
Python: 3.8
Transformers: 4.26.1
PyTorch: 1.12
CUDA Support: True

Anything else?

No response

Answer 1 · 2024-01-20T13:30:19.000Z

（1）可以先训练一个2000的，设置
--max_steps 2000
--save_steps 2000
（2）然后在上面继续训练，设置
--max_steps 3000
--save_steps 5

Answer 2 · 2024-01-21T02:33:37.000Z

（1）可以先训练一个2000的，设置 --max_steps 2000 --save_steps 2000 （2）然后在上面继续训练，设置 --max_steps 3000 --save_steps 5

感谢回复，这也是个思路。
请问有没有能直接一次训练就能做到的方法？因为有些平台上提交训练没法中断后再接着训练这样操作

Answer 3 · 2024-01-21T06:22:08.000Z

（1）可以先训练一个2000的，设置 --max_steps 2000 --save_steps 2000 （2）然后在上面继续训练，设置 --max_steps 3000 --save_steps 5

感谢回复，这也是个思路。请问有没有能直接一次训练就能做到的方法？因为有些平台上提交训练没法中断后再接着训练这样操作

这个我就不太知道了，抱歉