OpenDFM/Rememberer

Recreate results found in table 1

Opened this issue · 5 comments

Hi, I wanted to check if running launchw.sh is the command which recreates the number for table 1?
Cause I'm trying to rerun REMEMBERER for gpt-3.5-instruct-0913 due davinci-003 was no longer accessible from openai platform.

But the results I got is quite low with only 0.07 success rate

[2024-05-02 12:45:31,856 INFO webshop/186-MainProcess] END! TaskIdx: 99, TaskId: 99, #Steps: 4(0), Reward: 0.50, Succeds: False
[2024-05-02 12:45:31,856 INFO webshop/189-MainProcess] ──────────8.44──────────0.254──────────0.070──────────
[2024-05-02 12:45:31,857 INFO webshop/497-MainProcess] ━━━━━━━━━━━━━━━━━━━Epoch 0━━━━━━━━━━━━━━━━━━━━
[2024-05-02 12:45:31,857 INFO webshop/498-MainProcess] Size: 4, Avg AD Size: 1

I was wonder if there's any params I didn't get right for the launchw.sh?

This was the command found in launchw.sh:

python webshop.py --log-dir logs\
				  --observation-mode text_rich\
				  --load-replay history-pools/init_pool.wq.yaml\
				  --load-replay history-pools/init_pool.wq.yaml\
				  --save-replay history-pools/init_pool.wqu."$date_str".%d.a.yaml\
				  --save-replay history-pools/init_pool.wqu."$date_str".%d.b.yaml\
				  --item-capacity 500\
				  --action-capacity 20\
				  --matcher pgpat+insrel\
				  --prompt-template prompts/\
				  --max-tokens 200 \
				  --stop "Discouraged" \
				  --request-timeout 10.\
				  --starts-from 0\
				  --epochs 3\
				  --trainseta 0\
				  --trainsetb 10\
				  --testseta 0\
				  --testsetb 100

Hello, thanks for your question. Our recent results on another task set also reveal the performance decrease of gpt-3.5-instruct compared to text-davince-003 on decision-making tasks. Maybe this is attributed to the base capability variation of GPT models. Besides, it is weird that your history memory size is only 4 after an epoch of training on 10 tasks. Could you please have a double check on your training process? Currently, I don't find unusual arguments in your launch command.

@zdy023 it seems I didn't run with --train arguments, now I get a much higher history memory size. However with the new arguments I do not yield a higher success rate ( 0.022 compare to 0.070 ) is this normal on your side?

Hello, I don't think this is a normal result. Currently, I haven't conducted experiments on WebShop with gpt-instruct. I will follow your setting to try to reproduce the results in these weeks when I'm free.

@theblackcat102 Hello, just for sure, are you using the model gpt-3.5-turbo-instruct? I don't see a model named gpt-3.5-instuct-0913 on OpenAI's online document.

Hello, we conducted experiments with gpt-3.5-turbo-instruct and obtained the results as average score of 0.54 and success rate of 0.22. This is about a half performance of text-davinci-003, which is consistent with our observation on WikiHow task set.

We plan to test more recent models in the following weeks. Once the results are ready, we will update it in the repository.