tangqiaoyu/ToolAlpaca

finetune scripts

Opened this issue · 17 comments

Thank you for your outstanding work. I need to evaluate my base model after fine-tuning it, but it seems that our fine-tuning is different, leading to mistakes in the evaluation process. Could you please provide me with your fine-tuning code?

Thank you for your kind words and interest in our work. We used a modified version of the old fastchat code for the fine-tuning process. We have already uploaded this code for your convenience. You can find it at train.py.

Thank you very much for your reply. I have progressed to the final evaluation, but in order to reproduce the results of your paper, I would like to know how you calculated Metrics and acceptance rate, which I understand to be proportion of final_response_correctness, and Metrics? Is there a specific code? Thank you very much

After running evaluation.py, all the final results are stored in the statistics part in the args.output_path file. From this, you can obtain the final procedure/response/overall score.

Please note that because we used GPT-4 during the evaluation stage, the specific values you reproduce might have some deviation.

I am very sorry that I did not finish all the test sets before starting the evaluation, so I did not see the final statistical results. This is really a great project, but it may not run smoothly due to some version updates (especially langchain) or small bugs in the processing flow, can I propose a pr to fix these problems later? Thanks again for your reply

Thank you for your feedback and apologies for the inconvenience you've experienced. I would suggest trying out the same version of langchain that I am using, which is 0.0.147, and retry the process. If you still encounter issues, we would greatly appreciate your contribution through a pr.

When I was inferring, I found that after some data was fed back by API, the input to the model would exceed the maximum length of 2048 of 1lama_alpaca-7b, as shown below, which would cause the output of the model feedback to be unresolvable. I would like to ask if you have any good methods to solve this problem.

Token indices sequence length is longer than the specified maximum sequence length for this model (2103 > 2048). Running this sequence through the model will result in indexing error

llmchain prep prompts: [StringPromptValue(text='Answer the following questions as best you can. You have access to the following tools:\n\ngetHolidaysByCountry: Retrieve a list of holidays for a specific country with dates, descriptions, and types.\nParameters: {"country": "Required. string. The country for which holidays are to be retrieved.", "year": "integer. The year for which holidays are to be retrieved.", "month": "integer. The month for which holidays are to be retrieved.", "day": "integer. The day for which holidays are to be retrieved.", "language": "string. The language in which holiday descriptions should be returned."}\nOutput: A list of holidays with their dates, descriptions, and types for the specified country, year, month, and day.\n - Format: application/json\n - Structure: Array[Object{name, date, type, description}]\ngetHolidayDetails: Retrieve detailed information on a specific holiday, including its history, purpose, and traditions.\nParameters: {"holidayId": "Required. string. The unique identifier of the holiday for which detailed information is to be retrieved.", "language": "string. The language in which holiday descriptions should be returned."}\nOutput: Detailed information on the specified holiday, including its history, purpose, traditions, and any additional information available.\n - Format: application/json\n - Structure: Object{name, date, type, description, history, purpose, traditions, additionalInfo}\nsearchHolidays: Search for holidays based on keywords, country, and date range.\nParameters: {"keywords": "string. Keywords to search for in holiday names and descriptions.", "country": "string. The country for which holidays are to be searched.", "startDate": "string. The start date of the date range for which holidays are to be searched (format: YYYY-MM-DD).", "endDate": "string. The end date of the date range for which holidays are to be searched (format: YYYY-MM-DD).", "language": "string. The language in which holiday descriptions should be returned."}\nOutput: A list of holidays matching the search criteria, with their dates, descriptions, and types.\n - Format: application/json\n - Structure: Array[Object{name, holidayId, date, type, description}]\nlistCountries: Retrieve a list of countries supported by the API.\nParameters: {}\nOutput: A list of countries supported by the API, with their country codes and names.\n - Format: application/json\n - Structure: Array[Object{code, name}]\nlistLanguages: Retrieve a list of languages supported by the API for holiday descriptions.\nParameters: {}\nOutput: A list of languages supported by the API, with their language codes and names.\n - Format: application/json\n - Structure: Array[Object{code, name}]\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [getHolidaysByCountry, getHolidayDetails, searchHolidays, listCountries, listLanguages]\nAction Input: the input to the action, in JSON format.\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: I heard about a holiday called Diwali, but I don't remember which country it belongs to. Can you search for holidays with the keyword "Diwali" and tell me about its date, country, and a brief description?\nThought: I need to search for holidays with the keyword "Diwali" and then get the details of the holiday.\nAction: searchHolidays\nAction Input: {"keywords": "Diwali"}\nObservation: Status Code: 200. Response: {"holidays":[{"name":"Diwali","holidayId":"diwali","date":"2022-11-01","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."},{"name":"Diwali","holidayId":"diwali","date":"2023-10-21","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."},{"name":"Diwali","holidayId":"diwali","date":"2024-11-09","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."}]}\nThought: The searchHolidays tool returned three holidays with the keyword "Diwali". The first holiday is the most relevant, as it has the same date and type as the original query.\nAction: getHolidayDetails\nAction Input: {"holidayId": "diwali"}\nObservation: Status Code: 200. Response: {"name":"Diwali","date":"2022-11-01","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts.","history":"Diwali has its roots in ancient Hindu mythology. It commemorates the return of Lord Rama, along with his wife Sita and brother Lakshmana, from their 14-year exile and the defeat of the demon king Ravana. The festival symbolizes the triumph of good over evil and the restoration of righteousness.","purpose":"Diwali is a time for people to come together, celebrate, and express gratitude. It is a time to seek blessings from deities, perform religious rituals, and offer prayers for prosperity and well-being. The festival also promotes unity, harmony, and the spirit of sharing and caring.","traditions":"During Diwali, people decorate their homes with colorful rangoli designs, light oil lamps (diyas), and place candles around their houses. Fireworks are burst to ward off evil spirits and celebrate the victory of light. Families gather for feasts, exchange gifts, and share sweets. It is also common to visit temples, seek blessings from elders, and donate to the less fortunate.","additionalInfo":"Diwali is one of the most significant festivals in Hinduism and is celebrated not only in India but also by Hindu communities around the world. The festival typically lasts for five days, with each day having its own set of rituals and traditions. Diwali is a time for joy, renewal, and strengthening of relationships."}\nThought: The getHolidayDetails tool returned detailed information on the holiday, including its history, purpose, traditions, and any additional information available.\nAction: searchHolidays\nAction Input: {"keywords": "Diwali"}\nObservation: Status Code: 200. Response: {"holidays":[{"name":"Diwali","holidayId":"diwali","date":"2022-11-01","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."},{"name":"Diwali","holidayId":"diwali","date":"2023-10-21","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."},{"name":"Diwali","holidayId":"diwali","date":"2024-11-09","type":"Religious","description":"Diwali, also known as Deepavali, is a Hindu festival celebrated every year in autumn. It signifies the victory of light over darkness and good over evil. The festival involves lighting oil lamps, bursting fireworks, sharing sweets, and exchanging gifts."}]}\nThought:')]

This is a common occurrence when using tools, especially in real-world scenarios. During our testing with eval_real.json test set, we adopt a tricky method to add a retrieval tool. This tool takes advantage of models with a longer context length, such as gpt-3.5-turbo-16k-0613, to retrieve information from the files.

I'm sorry to bother you again, but could you share the script of your fine-tuning When I used vicuna-v1.5's fine-tuned model to reason, I found that intermediate_steps returned all blank, The call to the task chain should be documented in intermediate_steps(agent action), but the model answered the question correctly, as shown below:

Entering new CustomAgentExecutor chain...
intermediate_steps []
I should use the searchAxolotlImages tool to find pictures of both genders.
Action: searchAxolotlImages
Action Input: {"color": "wild", "gender": "male", "size": "small"}
Obation: I should use the same parameters for the female gender.
... (this Action/Action Input/Obation can repeat N times)
Thought: I should compare the pictures to see the differences between the genders.
Final Answer: Male and female axolotls differ in appearance, with the male having a more pronounced dorsal fin and a longer tail. Here is a comparison picture of a male and female wild-colored axolotl: [image of male and female axolotl]

Finished chain.
{"message":"clear Axolotl cache"}

This can lead to a failure to calculate the results during the evaluation, but I did not have this problem with direct inference in the original model vicuna-v1.5.I think there were some errors in the fine tuning. My script is as follows (2xA100) :
deepspeed --num_gpus 2
-- train.py
--model_name_or_path vicuna-7b-v1.5
--data_path train.json
--output_dir vicuna-7b-v1.5-toolalpaca
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 4096
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed stage3.json

The script parameter Settings refer to your paper and fastchat

Some information during the fine-tuning process may be worth noting:
Token indices sequence length is longer than the specified maximum sequence length for this model (2259 > 2048). Running this sequence through the model will result in indexing errors

Thank you very much

I'm sorry to bother you again, but could you share the script of your fine-tuning When I used vicuna-v1.5's fine-tuned model to reason, I found that intermediate_steps returned all blank, The call to the task chain should be documented in intermediate_steps(agent action), but the model answered the question correctly, as shown below:

Entering new CustomAgentExecutor chain...
intermediate_steps []
I should use the searchAxolotlImages tool to find pictures of both genders.
Action: searchAxolotlImages
Action Input: {"color": "wild", "gender": "male", "size": "small"}
Obation: I should use the same parameters for the female gender.
... (this Action/Action Input/Obation can repeat N times)
Thought: I should compare the pictures to see the differences between the genders.
Final Answer: Male and female axolotls differ in appearance, with the male having a more pronounced dorsal fin and a longer tail. Here is a comparison picture of a male and female wild-colored axolotl: [image of male and female axolotl]

Finished chain.
{"message":"clear Axolotl cache"}

This can lead to a failure to calculate the results during the evaluation, but I did not have this problem with direct inference in the original model vicuna-v1.5.I think there were some errors in the fine tuning. My script is as follows (2xA100) : deepspeed --num_gpus 2 -- train.py --model_name_or_path vicuna-7b-v1.5 --data_path train.json --output_dir vicuna-7b-v1.5-toolalpaca --bf16 True --num_train_epochs 3 --per_device_train_batch_size 2 --gradient_accumulation_steps 16 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 4096 --gradient_checkpointing True --lazy_preprocess True --deepspeed stage3.json

The script parameter Settings refer to your paper and fastchat

Some information during the fine-tuning process may be worth noting: Token indices sequence length is longer than the specified maximum sequence length for this model (2259 > 2048). Running this sequence through the model will result in indexing errors

Thank you very much

This problem should be due to the excellent capability of vicuna-v1.5 itself, which needs to be further adapted to the langchain code to solve

I recently evaluated it on different models, and the results are roughly as follows, which I won't expand here due to various problems with real API access, just listing simulated API results
截屏2023-10-12 10 22 54

Note that I used gpt-35-turbo-0301, the GPT4 used in the paper, in the model evaluation process, and the evaluation index would rise or fall by 2% due to randomness.

Thank you very much for sharing all the results and the pr. I understand that vicuna-v1.5 itself has excellent performance. However, could you please elaborate on what you mean by "adapted to the langchain code to solve"?

vicuna-v1.5 in the return thought of 99 is directly return the agent-finish state, no agent-action, I wonder if I can directly put the agent-finish in the "intermediate_steps" in the plan module of langchain. steps" in langchain plan module?

Are you referring to vicuna-v1.5 before or after fine-tuning? I believe it's unlikely for this situation to occur after fine-tuning.

Are you referring to vicuna-v1.5 before or after fine-tuning? I believe it's unlikely for this situation to occur after fine-tuning.

In fact, it is a fine-tuned model of vicuna-v1.5

Thank you for your clarification. I will fine-tune the vicuna-v1.5 model in my spare time and investigate the reason why it's not working. I will get back to you once I have more information.

Thank you very much🙏

I'm sorry to bother you again, but could you share the script of your fine-tuning When I used vicuna-v1.5's fine-tuned model to reason, I found that intermediate_steps returned all blank, The call to the task chain should be documented in intermediate_steps(agent action), but the model answered the question correctly, as shown below:

Entering new CustomAgentExecutor chain...
intermediate_steps []
I should use the searchAxolotlImages tool to find pictures of both genders.
Action: searchAxolotlImages
Action Input: {"color": "wild", "gender": "male", "size": "small"}
Obation: I should use the same parameters for the female gender.
... (this Action/Action Input/Obation can repeat N times)
Thought: I should compare the pictures to see the differences between the genders.
Final Answer: Male and female axolotls differ in appearance, with the male having a more pronounced dorsal fin and a longer tail. Here is a comparison picture of a male and female wild-colored axolotl: [image of male and female axolotl]

Finished chain.
{"message":"clear Axolotl cache"}

This can lead to a failure to calculate the results during the evaluation, but I did not have this problem with direct inference in the original model vicuna-v1.5.I think there were some errors in the fine tuning. My script is as follows (2xA100) : deepspeed --num_gpus 2 -- train.py --model_name_or_path vicuna-7b-v1.5 --data_path train.json --output_dir vicuna-7b-v1.5-toolalpaca --bf16 True --num_train_epochs 3 --per_device_train_batch_size 2 --gradient_accumulation_steps 16 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 4096 --gradient_checkpointing True --lazy_preprocess True --deepspeed stage3.json
The script parameter Settings refer to your paper and fastchat
Some information during the fine-tuning process may be worth noting: Token indices sequence length is longer than the specified maximum sequence length for this model (2259 > 2048). Running this sequence through the model will result in indexing errors
Thank you very much

This problem should be due to the excellent capability of vicuna-v1.5 itself, which needs to be further adapted to the langchain code to solve

I recently evaluated it on different models, and the results are roughly as follows, which I won't expand here due to various problems with real API access, just listing simulated API results 截屏2023-10-12 10 22 54

Note that I used gpt-35-turbo-0301, the GPT4 used in the paper, in the model evaluation process, and the evaluation index would rise or fall by 2% due to randomness.

Hi @zhanghy-sketchzh , based on the results you shared, looks like your fine-tuned codellama-instruct had super high performance. Do you have more information about that? Is it fine-tuned with data/train_data.json in this repo? btw, if you want to discuss more on this, we can connect via Zhaoliang.yang@umontreal.ca

I'm sorry to bother you again, but could you share the script of your fine-tuning When I used vicuna-v1.5's fine-tuned model to reason, I found that intermediate_steps returned all blank, The call to the task chain should be documented in intermediate_steps(agent action), but the model answered the question correctly, as shown below:

Entering new CustomAgentExecutor chain...
intermediate_steps []
I should use the searchAxolotlImages tool to find pictures of both genders.
Action: searchAxolotlImages
Action Input: {"color": "wild", "gender": "male", "size": "small"}
Obation: I should use the same parameters for the female gender.
... (this Action/Action Input/Obation can repeat N times)
Thought: I should compare the pictures to see the differences between the genders.
Final Answer: Male and female axolotls differ in appearance, with the male having a more pronounced dorsal fin and a longer tail. Here is a comparison picture of a male and female wild-colored axolotl: [image of male and female axolotl]

Finished chain.
{"message":"clear Axolotl cache"}

This can lead to a failure to calculate the results during the evaluation, but I did not have this problem with direct inference in the original model vicuna-v1.5.I think there were some errors in the fine tuning. My script is as follows (2xA100) : deepspeed --num_gpus 2 -- train.py --model_name_or_path vicuna-7b-v1.5 --data_path train.json --output_dir vicuna-7b-v1.5-toolalpaca --bf16 True --num_train_epochs 3 --per_device_train_batch_size 2 --gradient_accumulation_steps 16 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 4096 --gradient_checkpointing True --lazy_preprocess True --deepspeed stage3.json
The script parameter Settings refer to your paper and fastchat
Some information during the fine-tuning process may be worth noting: Token indices sequence length is longer than the specified maximum sequence length for this model (2259 > 2048). Running this sequence through the model will result in indexing errors
Thank you very much

This problem should be due to the excellent capability of vicuna-v1.5 itself, which needs to be further adapted to the langchain code to solve
I recently evaluated it on different models, and the results are roughly as follows, which I won't expand here due to various problems with real API access, just listing simulated API results 截屏2023-10-12 10 22 54
Note that I used gpt-35-turbo-0301, the GPT4 used in the paper, in the model evaluation process, and the evaluation index would rise or fall by 2% due to randomness.

Hi @zhanghy-sketchzh , based on the results you shared, looks like your fine-tuned codellama-instruct had super high performance. Do you have more information about that? Is it fine-tuned with data/train_data.json in this repo? btw, if you want to discuss more on this, we can connect via Zhaoliang.yang@umontreal.ca

I did the experiment exactly according to the author's fine-tuning method, train_data.json also need to pass build_dataset.py