git clone https://github.com/zw-SIMM/SFTLLMs_for_chemtext_mining
cd SFTLLMs_for_ChemText_Mining
Preprocessed data, fine-tuning codes, README workflows have been placed in corresponding folders:
-
Paragraph2Comound/
-
Paragraph2RXNRole/prod/
andParagraph2RXNRole/role/
-
Paragraph2MOFInfo/
-
Paragraph2NMR/
-
Paragraph2Action/
(dataset is derived from pistachio dataset, which is available upon request.)
pip install openai
pip install pandas
Note: The fine-tuning code has been slightly different as the version of openai updated to v1.0.0+.
Here, we provide the latest code.
Specific scripts for each task are in the corresponding folders.
All notebooks of fine-tuning and prompt engineering GPTs (GPT-4, GPT-3.5) as well as evaluating for each task has beed released!
Here, we gave an example notebook of fine-tuning ChatGPT on 25 Paragraph2NMR data in demo/fine-tuning_chatgpt_on_25_paragraph2NMR_data.ipynb
, including:
- Preprocessing
- Training
- Inferencing
- Evaluating
mamba create -n llm python=3.10
mamba activate llm
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas numpy ipywidgets tqdm
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch==2.1.2 transformers==4.38.2 datasets tiktoken wandb==0.11 openpyxl
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple peft==0.8.0 accelerate bitsandbytes safetensors jsonlines
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple vllm==0.3.1
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple trl==0.7
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboardX tensorboard
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple textdistance nltk matplotlib seaborn seqeval
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple modelscope
Open-sourced pretrained models (Llama3, Llama2, Mistral, Bart, T5) can be downloaded from huggingface or modelscope.
Here is an example for downloading pretrained models by scripts on linux servers from modelscope:
from modelscope import snapshot_download
model_dir = snapshot_download("LLM-Research/Meta-Llama-3-8B-Instruct", revision='master', cache_dir='/home/pretrained_models')
model_dir = snapshot_download('AI-ModelScope/Mistral-7B-Instruct-v0.2', revision='master', cache_dir='/home/pretrained_models')
The codes and tutorials of Fine-tuning Language models (ChatGPT, Llama3, Llama2, Mistral, Bart, T5) for each task are in the corresponding folders.