Cannot replicate the results in the paper
tongzhoumu opened this issue ยท 50 comments
Hi GATA authors,
I am trying to replicate the experiment result presented in your paper. Specifically, I tried GATA-OG with text observation and GATA-GTF with text observation. The setting I used is 20 training games + difficulty level=1 (actually 3 in your codes).
- For GATA-OG with text observation, I ran the experiments three times and got around 0.3-0.3125 eval normalized game points. In paper, the result is 0.662.
- For GATA-GTF with text observation, I ran the experiments three times and got around 0.25-0.4 eval normalized game points. In paper, the result is 0.487.
My questions are:
- How many random seeds you used to get the results in the table 1? And what are those random seeds?
- How should I replicate the results presented in your paper? Following the instructions in the README seems not work.
Any responses or suggestions are appreciated. Thank you!
Hi @tongzhoumu, thanks for reporting! We used 123/321/666 as random seeds. Let me rerun the two experiments you mentioned and try to see why they are happening.
@tongzhoumu Just to make sure I understand your experiment settings correctly, In our paper, the 48.7% experiment is pre-trained with SP (Table 6 in page 26), but without text observation, is that the setting you were referring to?
@tongzhoumu Just to make sure I understand your experiment settings correctly, In our paper, the 48.7% experiment is pre-trained with SP (Table 6 in page 26), but without text observation, is that the setting you were referring to?
No I am referring to GATA-GTF's result on difficulty level 1, which is shown in table 2 in page 7.
Yeah I believe that's the same thing. In table 2, we chose the row (from Table 6) with the highest relative improvement (81.6% in this case). On difficulty 1, that setting got 48.7% of test score. I'll rerun this try to see what has changed, and let you know.
Yeah I believe that's the same thing. In table 2, we chose the row (from Table 6) with the highest relative improvement (81.6% in this case). On difficulty 1, that setting got 48.7% of test score. I'll rerun this try to see what has changed, and let you know.
Thanks for your responses!
After carefully reading your experiment settings, I found that the GATA-GTF experiments I ran are GATA-GTF + N/A + with text observation. And this setting should get 92.5 according to the table 6, so the gap is even larger.
For the GATA-OG experiments I ran, the settings are GATA-OG+with text observation.
Yes the gap is quite large so I suspect something is wrong here. To help me figure out the problem, could you let me know how the normalized training scores change in these experiments?
In GATA-GTF + N/A + with text observation, the "train normalized game points" are 0.513, 0.3325, 0.526, "eval normalized game points" are 0.4, 0.35, 0.5875, from three runs. A figure showing the curve is like this
In GATA-OG+with text observation, the "train normalized game points" are 0.398, 0.357, 0.454, "eval normalized game points" are 0.3, 0.475, 0.325, from three runs. A figure showing the curve is like this
Thanks, these are very helpful! I'll let you know when I find anything.
In the GTF runs where you got 0.3325, 0.874, 0.526 training scores, what are the best validation scores correspond to those three seeds?
I assume the "0.25-0.4" on the very top of this issue were scores on test set. Now I'm a bit worried that if 0.874 of training score gives ~0.4 test score, the data splits might have a mismatch?
Did some quick retrieval of our experiment logs (GATA-GTF + N/A + with text observation):
seed / best train score / best valid score
321 / 0.912 / 0.7625
123 / 0.873 / 0.5875
666 / 0.9675 / 0.8375
Will continue to look into this.
Hi, the results I reported at the very top are not well logged, so I re-run the experiments. I just updated the comment above (#21 (comment)) to add more numbers.
In addition, in the previous comments, the 0.874 one is wrongly added there, it is from another experiment. Please refer to the updated comment. Thanks!
Did some quick retrieval of our experiment logs (GATA-GTF + N/A + with text observation):
seed / best train score / best valid score
321 / 0.912 / 0.7625
123 / 0.873 / 0.5875
666 / 0.9675 / 0.8375Will continue to look into this.
I am also running the GATA-GTF + N/A + with text observation with the random seeds your provided. However, it will take ~12 hours for me to get the final results. Will update with you after I got them.
In addition, I am not sure why the experiments are so slow from my side. I am using a V100, it should not be too bad.
After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?
Some of the components in TextWorld may be overkill for this specific task, I think @MarcCote is working on a new version, including removing some slow dependencies.
Some of the components in TextWorld may be overkill for this specific task, I think @MarcCote is working on a new version, including removing some slow dependencies.
Great, thanks! But why the experiments are so fast from your side? Are you using an internal version of TextWorld?
No from my side, GTF will take 2 days to converge (I'm using P100 so probably slower than you)
The scores I sent you were from our logs, not what I'm running now.
No from my side, GTF will take 2 days to converge (I'm using P100 so probably slower than you)
The scores I sent you were from our logs, not what I'm running now.
Thanks for your information. It seems the results provided by you are higher than what I got, but lower than the number reported in paper (0.925). I am looking forward to any further details about it. Thank you so much!
Yeah in the paper we reported the test scores obtained by the seed which performed the best on validation set (666 in the above case).
Yeah in the paper we reported the test scores obtained by the seed which performed the best on validation set (666 in the above case).
I see. Then what is the command to evaluate the checkpoint on test set instead of validation set? Thanks!
I will update the script in a later version (probably after fixing the bug you're facing) but you can make a new eval env like this, but with valid_or_test="test"
. After loading the pre-trained agent, use that eval env by doing this, which will go over all test games once and report the scores.
After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?
@tongzhoumu how did you benchmark this? Does it also include time spent in the model (i.e. inference)?
After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?
@tongzhoumu how did you benchmark this? Does it also include time spent in the model (i.e. inference)?
@MarcCote No, only environment simulation time. My test codes are attached below, which are adapted from here.
import os
import glob
import gym
import time
import numpy as np
import textworld.gym
from textworld import EnvInfos
GAMES_DIR = '../GATA-public/rl.0.2'
REQUEST_INFOS = EnvInfos()
REQUEST_INFOS.admissible_commands = True
REQUEST_INFOS.description = True
REQUEST_INFOS.location = True
REQUEST_INFOS.facts = True
REQUEST_INFOS.last_action = True
REQUEST_INFOS.game = True
def get_one_env():
env, _ = get_training_game_env(data_dir=GAMES_DIR,
difficulty_level=3,
training_size=1,
requested_infos=REQUEST_INFOS,
max_episode_steps=50,
batch_size=None,
)
return env
def get_training_game_env(data_dir, difficulty_level, training_size, requested_infos, max_episode_steps, batch_size):
assert difficulty_level in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 99]
assert training_size in [1, 20, 100]
# training games
game_file_names = []
game_path = data_dir + "/train_" + str(training_size) + "/difficulty_level_" + str(difficulty_level)
if os.path.isdir(game_path):
game_file_names += glob.glob(os.path.join(game_path, "*.z8"))
else:
game_file_names.append(game_path)
env_id = textworld.gym.register_games(sorted(game_file_names), request_infos=requested_infos,
max_episode_steps=max_episode_steps, batch_size=batch_size,
name="training", asynchronous=False, auto_reset=False)
env = gym.make(env_id)
num_game = len(game_file_names)
return env, num_game
def test_fps():
env = get_one_env()
obs, infos = env.reset()
n = 3000
cnt = 0
st = time.time()
while cnt < n:
cnt += 1
possible_actions = infos['admissible_commands']
action = np.random.choice(possible_actions)
obs, score, done, infos = env.step(action)
if done:
env.reset()
env.close()
dur = time.time() - st
print('Dur: {:.2f}'.format(dur))
print('FPS: {:.2f}'.format(n / dur))
if __name__ == '__main__':
test_fps()
@xingdi-eric-yuan Hi, I ran the GATA-GTF + N/A + with text observation experiments with the seeds provided by you. And I picked the best validation scores from the training logs.
seed / final train score / final valid score / best valid score
123 / 0.339 / 0.3 / 0.325
321 / 0.3645 / 0.3 / 0.375
666 / 0.5185 / 0.5875 / 0.5875
It seems my results are different from yours even if using the same seeds.
Another finding is that the random seed seems cannot fully control the training process. I ran the experiments with seed=123 twice but I got different training progress. I set the random seed by modifying this line.
Marc and I are looking into this. I'll keep you updated.
Some updates:
The issue may be caused by some changes in the h5py package. A few days ago there was a PR fixing an error HERE by splitting the glove vocab using bytes b'\n'
instead of '\n'
. However, this caused an mismatch between glove vocab and our vocabulary.
A quick pdb at HERE suggests that words in words_list
(GATA vocab) are not in self. word2id
(glove vocab), which means that in this version, we are training the network with randomly initialized word embeddings, and they are kept fixed.
A quick fix can be adding a line self.id2word = [item.decode("utf-8") for item in self.id2word]
after THIS. I will start an experiment to see if this fixes the reproducibility issue and will let you know.
Some updates:
The issue may be caused by some changes in the h5py package. A few days ago there was a PR fixing an error HERE by splitting the glove vocab using bytes
b'\n'
instead of'\n'
. However, this caused an mismatch between glove vocab and our vocabulary.A quick pdb at HERE suggests that words in
words_list
(GATA vocab) are not inself. word2id
(glove vocab), which means that in this version, we are training the network with randomly initialized word embeddings, and they are kept fixed.A quick fix can be adding a line
self.id2word = [item.decode("utf-8") for item in self.id2word]
after THIS. I will start an experiment to see if this fixes the reproducibility issue and will let you know.
@xingdi-eric-yuan Thanks for pointing this out! However, the version I am using was clone around 20 days ago and does not contain this bug. I have pdb into HERE and found most entries in in_vocab
are true. I guess the performance gap between my experiments and your experiments is caused by other things. And as I mentioned in earlier comments, I am concerning about the why the random seed cannot control the training process.
@tongzhoumu Good to know, let me dig deeper.
Re: on the env speed.
If you don't need it, you can set request_infos.description = False
to get a decent ~2x speedup. The reason being when it is True
, it is equivalent to issuing a "look" command after every command sent to the game.
Having request_infos.admissible_commands = True
also has overhead computation.
EDIT: forgot to mention that your FPS is comparable to what I got on my machine. So nothing's wrong on your end.
@tongzhoumu what version of textworld are you using exactly?
Re: on the env speed.
If you don't need it, you can set
request_infos.description = False
to get a decent ~2x speedup. The reason being when it isTrue
, it is equivalent to issuing a "look" command after every command sent to the game.Having
request_infos.admissible_commands = True
also has overhead computation.EDIT: forgot to mention that your FPS is comparable to what I got on my machine. So nothing's wrong on your end.
Thanks for your response! Disabling some information does make it run faster. And I am using textworld=1.4.3, should I upgrade or downgrade it?
Version 1.3+ is fine, I think. I'm trying to look into the random seed issue.
@tongzhoumu I managed to track down the source of randomness in the different training runs (see #23).
I noticed here a comment that difficulty level 3 in the code = level 1 from the paper. Can this be confirmed, and further, what's the relation then to the 10 levels of difficulty in the code vs. the 5 in the paper?
@MathieuTuli Difficulty_level: 3/7/5/9 correspond to the 1/2/3/4 in the paper.
@tongzhoumu I managed to track down the source of randomness in the different training runs (see #23).
That is great! After the new commit is merged, can we run the experiments with the same random seeds and compare the results from both sides?
@tongzhoumu I need more time to dig deeper, I'm now getting similar training curves as you provided, which are far away from what we got last year. I'm still trying different things to see what's the differences. Currently I'm reverting my pytorch back to 1.4 to see if there's anything interesting.
@tongzhoumu I need more time to dig deeper, I'm now getting similar training curves as you provided, which are far away from what we got last year. I'm still trying different things to see what's the differences. Currently I'm reverting my pytorch back to 1.4 to see if there's anything interesting.
@xingdi-eric-yuan Thanks! There are some information which might be helpful to your debugging process. I feel the performance gap might come from two aspects:
- The training efficiency. By default, your codes train the agent for 100k episodes, but the agent seems cannot converge to a near-perfect performance in training set in 100k episodes, even if GT graphs are provided (GATA-GTF).
- Gap between training and validation/test environment. According to your paper, GATA-GTF can achieve 92.5 in test environments, which means the gap between training and test environments is at most 7.5 . However in my experiments the gap is much larger.
The following figures show the an agent (GATA-GTF + N/A + with text observation) trained for 500k episodes. The agent can achieve near-optimal performance in training environments given more time, but the generalization gap is still large.
Hope those information can help you debug. And I am glad to provide more information if you feel something is useful for your debugging. Looking forward to your updates!
Thanks for the info, this is very helpful!
I might have figured out the issue (not tested yet, but it's very likely to be the reason)
I checked some of our logs when we ran these experiments last year, in all RL training, we update the network per 50 steps. However, the 50-step counts all steps performed within a batch. This is actually mentioned in our paper footnote 7, page 34.
In our public repo, we use batch size of 25 HERE, in this case, we should update the network per 2 steps HERE, instead of 50.
I think this bug aligns well with @tongzhoumu 's observation that the agent improves super slowly. With this particular bug, we are updating about 4% as frequent as what we did in the neurips paper.
I haven't tested yet, will update you guys later.
@xingdi-eric-yuan Yes, I do agree that the DQN training frequency is too low in the current version (actually I have noticed this but I thought this is intended.....). I am so glad to see the agent can fit to the training environment well now! Thank you!
However, as I mentioned the above comment, the generalization performance is still not as good as what reported in the paper. In paper, it is 92.5, but it is hard to exceed 60 in my experiments by running your codes (and your figure shows similar results). With the bugs fixed, can you reproduced the number reported in paper now?
And I know that the 92.5 is the scores on test environment while the figures and logs show the scores on validation environment. But my assumption is that the difficulties / distributions of test and validation environments are close, otherwise it makes no sense to use validation environments to pick the checkpoint.
I need to run this experiment longer (in the paper we mentioned 2 days for GATA-GTF), 20k episodes is only 1/5 of the training. We'll see then. (If still 0.6, there might be other issues that I'll hunt.)
I need to run this experiment longer (in the paper we mentioned 2 days for GATA-GTF), 20k episodes is only 1/5 of the training. We'll see then. (If still 0.6, there might be other issues that I'll hunt.)
@xingdi-eric-yuan Hi, did you get any results?
@tongzhoumu Not yet, I've been distracted by another project, I'm still slowly trying out things, will update you in a few days.
In continuation with replicating results, I didn't see any explicit mention of how to run the TR-DQN/TR-DRQN/TR-DRQN+ baselines. Am I correct in that these baselines are achieved by simply toggling enable_graph_input: False
and only including text inputs in the training config?
@MathieuTuli That's right. For DRQN/DRQN+, one needs to enable THIS, for DRQN+ (with count-based rewards), please set THIS to 1.0.
@tongzhoumu I managed to run the code once more, similar to what you observed, the eval score was around 0.6, and the peaks are about 0.635. I agree with you that in this setting, where the sizes of train/valid/test sets are small, the distribution shift might be non negligible. A better way to go is to have the game engine to generate a new game at every episode (we did some estimation in another paper, in total there can be about 10^40 different games, where we call unlimited training set), and in that setting, even validation and test sets would be less necessary (because you don't see the same game twice).
For now, my plan is to fix the few things/bugs we discussed in this issue to make sure people don't suffer from these bugs anymore. I will continue to look into the code to see if there's anything else that behaves strangely (although 0.635 seems within the range of 0.7625/0.5875/0.8375 validation scores we got last year, but it's still quite low compared to 0.83).
Thanks again to @tongzhoumu and @MathieuTuli for trying the code out and help targeting the bugs.
@xingdi-eric-yuan Thanks so much for your effort! What makes me a little bit concerned is that the numbers reported in table 6 of the paper for the setting "GATA + N/A + text observation + training on 20 environments" is 92.5, which seems really high compared to what we got from running the codes.
Hope you find the issues soon and please get me updated. Thank you!
@tongzhoumu I have retrieved the "GATA + N/A + text observation + training on 20 env" checkpoint we used in the neurips paper, and I've uploaded it HERE.
Running the below testing script (__test.py
) by :
python __test.py configs/train_gata_gtf_rl.yaml
from agent import Agent
import generic
import evaluate
import reinforcement_learning_dataset
def test():
config = generic.load_config()
agent = Agent(config)
output_dir = "."
data_dir = "."
# make game environments
requested_infos = agent.select_additional_infos()
games_dir = "./"
eval_env, num_eval_game = reinforcement_learning_dataset.get_evaluation_game_env(games_dir + config['rl']['data_path'],
3, # difficulty level
requested_infos,
agent.eval_max_nb_steps_per_episode,
agent.eval_batch_size,
valid_or_test="test")
agent.load_pretrained_model(data_dir + "/gtf_text_lv1.pt.pt", load_partial_graph=False)
eval_game_points, eval_game_points_normalized, eval_game_step, _, detailed_scores = evaluate.evaluate(eval_env, agent, num_eval_game)
print(eval_game_points, eval_game_points_normalized, eval_game_step)
if __name__ == '__main__':
test()
I got the below scores:
EVAL: rewards: 3.700 | normalized reward: 0.925 | used steps: 7.700
game name: 3ZMbUk7oiJW3sav6iQ0Yd, reward: 4.000, normalized reward: 1.000, steps: 10.000
game name: ZW9Eu55OC3MbHvn1hD0XJ, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: 2OyWcRqouyb5HLbaHLOJ, reward: 4.000, normalized reward: 1.000, steps: 18.000
game name: pE1dCOXKi7B7Fq0YSOPD, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: a6EdHb3bimbjuqaaIrag, reward: 4.000, normalized reward: 1.000, steps: 18.000
game name: DK38HLJVtRRyi3d7t2DXZ, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: BGbRCRVDsM3Lt71mho5N6, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: 0nQyHWbvh6dXFPmhLKX, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: PB3dhaRmhbZ5C6ERCeED, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: r69eHo3vFnk6srk2soaB, reward: 4.000, normalized reward: 1.000, steps: 10.000
game name: qnxpTadcKEKHqyeixvK, reward: 1.000, normalized reward: 0.250, steps: 4.000
game name: Y8oktyOLCkBLTglqh7pJN, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: QZ7WiY3bF36Jt7jbiKog, reward: 4.000, normalized reward: 1.000, steps: 7.000
game name: QY9MSo7dhR0YhX73FNOE3, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: O7yoUjMDtn8pSQK8IYYr, reward: 1.000, normalized reward: 0.250, steps: 5.000
game name: 0vYxsEKdIW0VIMx2SJnO, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: rNyeUk5rU5Mbcbk1cdvnV, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: ZO5gUy8LC2ejHRO9irkB, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: E0qyI5kvcjbjUBj2SOqj, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: adleiR2mfoMJCKr9CYgg, reward: 4.000, normalized reward: 1.000, steps: 5.000
It is clear that as you mentioned, the train/valid/test sets are too small so that the distribution shift may be a non negligible problem, but It's also nice to re-confirm the testing scores are not obtained by bug. We plan to provide a full set of experiment checkpoints we used to produce the testing scores at neurips in the near future.
@xingdi-eric-yuan I tried the provided checkpoint and got the same result with you. But how to use the current codes to train such a checkpoint is still unclear to me. It seems a little bit luck is needed.
And I am glad to see you can release the checkpoints in the near future. Thank you so much and looking forward to it!
@tongzhoumu thank you for reporting this issue, I'll close it for now and let you know when we release the checkpoints.