面向保险领域的Large Language Model 训练pipline pretrain instruct learning reward learning Reninforcement Learning 环境搭建 docker