Cannot replicate the results in the paper

Question

Cannot replicate the results in the paper

tongzhoumu opened this issue 3 years ago · 50 comments

Hi GATA authors,

I am trying to replicate the experiment result presented in your paper. Specifically, I tried GATA-OG with text observation and GATA-GTF with text observation. The setting I used is 20 training games + difficulty level=1 (actually 3 in your codes).

For GATA-OG with text observation, I ran the experiments three times and got around 0.3-0.3125 eval normalized game points. In paper, the result is 0.662.
For GATA-GTF with text observation, I ran the experiments three times and got around 0.25-0.4 eval normalized game points. In paper, the result is 0.487.

My questions are:

How many random seeds you used to get the results in the table 1? And what are those random seeds?
How should I replicate the results presented in your paper? Following the instructions in the README seems not work.

Any responses or suggestions are appreciated. Thank you!

Answer 1 · 2021-07-19T17:16:32.000Z

Hi @tongzhoumu, thanks for reporting! We used 123/321/666 as random seeds. Let me rerun the two experiments you mentioned and try to see why they are happening.

Answer 2 · 2021-07-19T17:29:06.000Z

@tongzhoumu Just to make sure I understand your experiment settings correctly, In our paper, the 48.7% experiment is pre-trained with SP (Table 6 in page 26), but without text observation, is that the setting you were referring to?

Answer 3 · 2021-07-19T17:45:34.000Z

@tongzhoumu Just to make sure I understand your experiment settings correctly, In our paper, the 48.7% experiment is pre-trained with SP (Table 6 in page 26), but without text observation, is that the setting you were referring to?

No I am referring to GATA-GTF's result on difficulty level 1, which is shown in table 2 in page 7.

Answer 4 · 2021-07-19T17:49:55.000Z

Yeah I believe that's the same thing. In table 2, we chose the row (from Table 6) with the highest relative improvement (81.6% in this case). On difficulty 1, that setting got 48.7% of test score. I'll rerun this try to see what has changed, and let you know.

Answer 5 · 2021-07-19T19:25:56.000Z

Yeah I believe that's the same thing. In table 2, we chose the row (from Table 6) with the highest relative improvement (81.6% in this case). On difficulty 1, that setting got 48.7% of test score. I'll rerun this try to see what has changed, and let you know.

Thanks for your responses!

After carefully reading your experiment settings, I found that the GATA-GTF experiments I ran are GATA-GTF + N/A + with text observation. And this setting should get 92.5 according to the table 6, so the gap is even larger.

For the GATA-OG experiments I ran, the settings are GATA-OG+with text observation.

Answer 6 · 2021-07-19T19:38:55.000Z

Yes the gap is quite large so I suspect something is wrong here. To help me figure out the problem, could you let me know how the normalized training scores change in these experiments?

Answer 7 · 2021-07-19T19:44:49.000Z

In GATA-GTF + N/A + with text observation, the "train normalized game points" are 0.513, 0.3325, 0.526, "eval normalized game points" are 0.4, 0.35, 0.5875, from three runs. A figure showing the curve is like this

In GATA-OG+with text observation, the "train normalized game points" are 0.398, 0.357, 0.454, "eval normalized game points" are 0.3, 0.475, 0.325, from three runs. A figure showing the curve is like this

Answer 8 · 2021-07-19T19:49:16.000Z

Thanks, these are very helpful! I'll let you know when I find anything.

Answer 9 · 2021-07-19T19:59:25.000Z

In the GTF runs where you got 0.3325, 0.874, 0.526 training scores, what are the best validation scores correspond to those three seeds?

I assume the "0.25-0.4" on the very top of this issue were scores on test set. Now I'm a bit worried that if 0.874 of training score gives ~0.4 test score, the data splits might have a mismatch?

Answer 10 · 2021-07-19T20:12:15.000Z

Did some quick retrieval of our experiment logs (GATA-GTF + N/A + with text observation):

seed / best train score / best valid score
321 / 0.912 / 0.7625
123 / 0.873 / 0.5875
666 / 0.9675 / 0.8375

Will continue to look into this.

Answer 11 · 2021-07-19T20:13:02.000Z

Hi, the results I reported at the very top are not well logged, so I re-run the experiments. I just updated the comment above (#21 (comment)) to add more numbers.

In addition, in the previous comments, the 0.874 one is wrongly added there, it is from another experiment. Please refer to the updated comment. Thanks!

Answer 12 · 2021-07-19T20:15:30.000Z

Did some quick retrieval of our experiment logs (GATA-GTF + N/A + with text observation):

seed / best train score / best valid score
321 / 0.912 / 0.7625
123 / 0.873 / 0.5875
666 / 0.9675 / 0.8375

Will continue to look into this.

I am also running the GATA-GTF + N/A + with text observation with the random seeds your provided. However, it will take ~12 hours for me to get the final results. Will update with you after I got them.

In addition, I am not sure why the experiments are so slow from my side. I am using a V100, it should not be too bad.

After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?

Answer 13 · 2021-07-19T21:13:44.000Z

Some of the components in TextWorld may be overkill for this specific task, I think @MarcCote is working on a new version, including removing some slow dependencies.

Answer 14 · 2021-07-19T21:34:10.000Z

Some of the components in TextWorld may be overkill for this specific task, I think @MarcCote is working on a new version, including removing some slow dependencies.

Great, thanks! But why the experiments are so fast from your side? Are you using an internal version of TextWorld?

Answer 15 · 2021-07-19T22:17:55.000Z

No from my side, GTF will take 2 days to converge (I'm using P100 so probably slower than you)
The scores I sent you were from our logs, not what I'm running now.

Answer 16 · 2021-07-19T22:29:21.000Z

No from my side, GTF will take 2 days to converge (I'm using P100 so probably slower than you)
The scores I sent you were from our logs, not what I'm running now.

Thanks for your information. It seems the results provided by you are higher than what I got, but lower than the number reported in paper (0.925). I am looking forward to any further details about it. Thank you so much!

Answer 17 · 2021-07-19T23:30:13.000Z

Yeah in the paper we reported the test scores obtained by the seed which performed the best on validation set (666 in the above case).

Answer 18 · 2021-07-20T00:36:06.000Z

Yeah in the paper we reported the test scores obtained by the seed which performed the best on validation set (666 in the above case).

I see. Then what is the command to evaluate the checkpoint on test set instead of validation set? Thanks!

Answer 19 · 2021-07-20T00:50:15.000Z

I will update the script in a later version (probably after fixing the bug you're facing) but you can make a new eval env like this, but with valid_or_test="test". After loading the pre-trained agent, use that eval env by doing this, which will go over all test games once and report the scores.

Answer 20 · 2021-07-20T14:18:48.000Z

After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?

@tongzhoumu how did you benchmark this? Does it also include time spent in the model (i.e. inference)?

Answer 21 · 2021-07-20T23:20:44.000Z

After simple inspections, I found the environment fps is only around 115 on my machine. Is this expected?

@tongzhoumu how did you benchmark this? Does it also include time spent in the model (i.e. inference)?

@MarcCote No, only environment simulation time. My test codes are attached below, which are adapted from here.

import os
import glob
import gym
import time
import numpy as np

import textworld.gym
from textworld import EnvInfos


GAMES_DIR = '../GATA-public/rl.0.2'
REQUEST_INFOS = EnvInfos()
REQUEST_INFOS.admissible_commands = True
REQUEST_INFOS.description = True
REQUEST_INFOS.location = True
REQUEST_INFOS.facts = True
REQUEST_INFOS.last_action = True
REQUEST_INFOS.game = True

def get_one_env():
    env, _ = get_training_game_env(data_dir=GAMES_DIR,
                                    difficulty_level=3,
                                    training_size=1,
                                    requested_infos=REQUEST_INFOS,
                                    max_episode_steps=50,
                                    batch_size=None,
                                    )
    return env

def get_training_game_env(data_dir, difficulty_level, training_size, requested_infos, max_episode_steps, batch_size):
    assert difficulty_level in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 99]
    assert training_size in [1, 20, 100]

    # training games
    game_file_names = []
    game_path = data_dir + "/train_" + str(training_size) + "/difficulty_level_" + str(difficulty_level)
    if os.path.isdir(game_path):
        game_file_names += glob.glob(os.path.join(game_path, "*.z8"))
    else:
        game_file_names.append(game_path)

    env_id = textworld.gym.register_games(sorted(game_file_names), request_infos=requested_infos,
                                          max_episode_steps=max_episode_steps, batch_size=batch_size,
                                          name="training", asynchronous=False, auto_reset=False)
    env = gym.make(env_id)
    num_game = len(game_file_names)
    return env, num_game



def test_fps():
    env = get_one_env()
    obs, infos = env.reset() 

    n = 3000
    cnt = 0

    st = time.time()
    while cnt < n:
        cnt += 1
        possible_actions = infos['admissible_commands']
        action = np.random.choice(possible_actions)
        obs, score, done, infos = env.step(action)
        if done:
            env.reset()

    env.close()

    dur = time.time() - st
    print('Dur: {:.2f}'.format(dur))
    print('FPS: {:.2f}'.format(n / dur))

if __name__ == '__main__':
    test_fps()

Answer 22 · 2021-07-20T23:36:14.000Z

@xingdi-eric-yuan Hi, I ran the GATA-GTF + N/A + with text observation experiments with the seeds provided by you. And I picked the best validation scores from the training logs.

seed / final train score / final valid score / best valid score
123 / 0.339 / 0.3 / 0.325
321 / 0.3645 / 0.3 / 0.375
666 / 0.5185 / 0.5875 / 0.5875

It seems my results are different from yours even if using the same seeds.

Another finding is that the random seed seems cannot fully control the training process. I ran the experiments with seed=123 twice but I got different training progress. I set the random seed by modifying this line.

Answer 23 · 2021-07-20T23:39:30.000Z

Marc and I are looking into this. I'll keep you updated.

Answer 24 · 2021-07-21T13:31:25.000Z

Some updates:

The issue may be caused by some changes in the h5py package. A few days ago there was a PR fixing an error HERE by splitting the glove vocab using bytes b'\n' instead of '\n'. However, this caused an mismatch between glove vocab and our vocabulary.

A quick pdb at HERE suggests that words in words_list (GATA vocab) are not in self. word2id (glove vocab), which means that in this version, we are training the network with randomly initialized word embeddings, and they are kept fixed.

A quick fix can be adding a line self.id2word = [item.decode("utf-8") for item in self.id2word] after THIS. I will start an experiment to see if this fixes the reproducibility issue and will let you know.

Answer 25 · 2021-07-21T16:42:10.000Z

Some updates:

The issue may be caused by some changes in the h5py package. A few days ago there was a PR fixing an error HERE by splitting the glove vocab using bytes b'\n' instead of '\n'. However, this caused an mismatch between glove vocab and our vocabulary.

A quick pdb at HERE suggests that words in words_list (GATA vocab) are not in self. word2id (glove vocab), which means that in this version, we are training the network with randomly initialized word embeddings, and they are kept fixed.

A quick fix can be adding a line self.id2word = [item.decode("utf-8") for item in self.id2word] after THIS. I will start an experiment to see if this fixes the reproducibility issue and will let you know.

@xingdi-eric-yuan Thanks for pointing this out! However, the version I am using was clone around 20 days ago and does not contain this bug. I have pdb into HERE and found most entries in in_vocab are true. I guess the performance gap between my experiments and your experiments is caused by other things. And as I mentioned in earlier comments, I am concerning about the why the random seed cannot control the training process.

Answer 26 · 2021-07-21T16:48:07.000Z

@tongzhoumu Good to know, let me dig deeper.

Answer 27 · 2021-07-21T17:33:50.000Z

Re: on the env speed.

If you don't need it, you can set request_infos.description = False to get a decent ~2x speedup. The reason being when it is True, it is equivalent to issuing a "look" command after every command sent to the game.

Having request_infos.admissible_commands = True also has overhead computation.

EDIT: forgot to mention that your FPS is comparable to what I got on my machine. So nothing's wrong on your end.

Answer 28 · 2021-07-21T17:47:59.000Z

@tongzhoumu what version of textworld are you using exactly?

Answer 29 · 2021-07-21T18:57:05.000Z

Re: on the env speed.

If you don't need it, you can set request_infos.description = False to get a decent ~2x speedup. The reason being when it is True, it is equivalent to issuing a "look" command after every command sent to the game.

Having request_infos.admissible_commands = True also has overhead computation.

EDIT: forgot to mention that your FPS is comparable to what I got on my machine. So nothing's wrong on your end.

Thanks for your response! Disabling some information does make it run faster. And I am using textworld=1.4.3, should I upgrade or downgrade it?

Answer 30 · 2021-07-21T20:02:46.000Z

Version 1.3+ is fine, I think. I'm trying to look into the random seed issue.

Answer 31 · 2021-07-22T16:47:02.000Z

@tongzhoumu I managed to track down the source of randomness in the different training runs (see #23).

Answer 32 · 2021-07-22T18:12:57.000Z

I noticed here a comment that difficulty level 3 in the code = level 1 from the paper. Can this be confirmed, and further, what's the relation then to the 10 levels of difficulty in the code vs. the 5 in the paper?

Answer 33 · 2021-07-22T19:15:30.000Z

@MathieuTuli Difficulty_level: 3/7/5/9 correspond to the 1/2/3/4 in the paper.

Answer 34 · 2021-07-22T19:23:28.000Z

@tongzhoumu I managed to track down the source of randomness in the different training runs (see #23).

That is great! After the new commit is merged, can we run the experiments with the same random seeds and compare the results from both sides?

Answer 35 · 2021-07-22T20:31:11.000Z

@tongzhoumu I need more time to dig deeper, I'm now getting similar training curves as you provided, which are far away from what we got last year. I'm still trying different things to see what's the differences. Currently I'm reverting my pytorch back to 1.4 to see if there's anything interesting.

Answer 36 · 2021-07-22T20:44:57.000Z

@tongzhoumu I need more time to dig deeper, I'm now getting similar training curves as you provided, which are far away from what we got last year. I'm still trying different things to see what's the differences. Currently I'm reverting my pytorch back to 1.4 to see if there's anything interesting.

@xingdi-eric-yuan Thanks! There are some information which might be helpful to your debugging process. I feel the performance gap might come from two aspects:

The training efficiency. By default, your codes train the agent for 100k episodes, but the agent seems cannot converge to a near-perfect performance in training set in 100k episodes, even if GT graphs are provided (GATA-GTF).
Gap between training and validation/test environment. According to your paper, GATA-GTF can achieve 92.5 in test environments, which means the gap between training and test environments is at most 7.5 . However in my experiments the gap is much larger.

The following figures show the an agent (GATA-GTF + N/A + with text observation) trained for 500k episodes. The agent can achieve near-optimal performance in training environments given more time, but the generalization gap is still large.

Hope those information can help you debug. And I am glad to provide more information if you feel something is useful for your debugging. Looking forward to your updates!

Answer 37 · 2021-07-22T20:47:11.000Z

Thanks for the info, this is very helpful!

Answer 38 · 2021-07-23T05:16:42.000Z

I might have figured out the issue (not tested yet, but it's very likely to be the reason)

I checked some of our logs when we ran these experiments last year, in all RL training, we update the network per 50 steps. However, the 50-step counts all steps performed within a batch. This is actually mentioned in our paper footnote 7, page 34.

In our public repo, we use batch size of 25 HERE, in this case, we should update the network per 2 steps HERE, instead of 50.

I think this bug aligns well with @tongzhoumu 's observation that the agent improves super slowly. With this particular bug, we are updating about 4% as frequent as what we did in the neurips paper.

I haven't tested yet, will update you guys later.

Answer 39 · 2021-07-23T13:43:10.000Z

I had a quick test, on (GATA-GTF + N/A + with text observation), using pytorch 1.4 tho, I will try more recent pt version later.
Overnight it has trained 20k episodes:

It seems to be the issue?

Answer 40 · 2021-07-23T16:36:21.000Z

@xingdi-eric-yuan Yes, I do agree that the DQN training frequency is too low in the current version (actually I have noticed this but I thought this is intended.....). I am so glad to see the agent can fit to the training environment well now! Thank you!

However, as I mentioned the above comment, the generalization performance is still not as good as what reported in the paper. In paper, it is 92.5, but it is hard to exceed 60 in my experiments by running your codes (and your figure shows similar results). With the bugs fixed, can you reproduced the number reported in paper now?

And I know that the 92.5 is the scores on test environment while the figures and logs show the scores on validation environment. But my assumption is that the difficulties / distributions of test and validation environments are close, otherwise it makes no sense to use validation environments to pick the checkpoint.

Answer 41 · 2021-07-23T16:41:37.000Z

I need to run this experiment longer (in the paper we mentioned 2 days for GATA-GTF), 20k episodes is only 1/5 of the training. We'll see then. (If still 0.6, there might be other issues that I'll hunt.)

Answer 42 · 2021-07-26T18:29:27.000Z

I need to run this experiment longer (in the paper we mentioned 2 days for GATA-GTF), 20k episodes is only 1/5 of the training. We'll see then. (If still 0.6, there might be other issues that I'll hunt.)

@xingdi-eric-yuan Hi, did you get any results?

Answer 43 · 2021-07-26T20:18:57.000Z

@tongzhoumu Not yet, I've been distracted by another project, I'm still slowly trying out things, will update you in a few days.

Answer 44 · 2021-07-28T17:12:07.000Z

In continuation with replicating results, I didn't see any explicit mention of how to run the TR-DQN/TR-DRQN/TR-DRQN+ baselines. Am I correct in that these baselines are achieved by simply toggling enable_graph_input: False and only including text inputs in the training config?

Answer 45 · 2021-07-28T20:19:41.000Z

@MathieuTuli That's right. For DRQN/DRQN+, one needs to enable THIS, for DRQN+ (with count-based rewards), please set THIS to 1.0.

Answer 46 · 2021-07-29T15:07:03.000Z

@tongzhoumu I managed to run the code once more, similar to what you observed, the eval score was around 0.6, and the peaks are about 0.635. I agree with you that in this setting, where the sizes of train/valid/test sets are small, the distribution shift might be non negligible. A better way to go is to have the game engine to generate a new game at every episode (we did some estimation in another paper, in total there can be about 10^40 different games, where we call unlimited training set), and in that setting, even validation and test sets would be less necessary (because you don't see the same game twice).

For now, my plan is to fix the few things/bugs we discussed in this issue to make sure people don't suffer from these bugs anymore. I will continue to look into the code to see if there's anything else that behaves strangely (although 0.635 seems within the range of 0.7625/0.5875/0.8375 validation scores we got last year, but it's still quite low compared to 0.83).

Thanks again to @tongzhoumu and @MathieuTuli for trying the code out and help targeting the bugs.

Answer 47 · 2021-07-29T17:07:18.000Z

@xingdi-eric-yuan Thanks so much for your effort! What makes me a little bit concerned is that the numbers reported in table 6 of the paper for the setting "GATA + N/A + text observation + training on 20 environments" is 92.5, which seems really high compared to what we got from running the codes.

Hope you find the issues soon and please get me updated. Thank you!

Answer 48 · 2021-08-04T16:45:04.000Z

@tongzhoumu I have retrieved the "GATA + N/A + text observation + training on 20 env" checkpoint we used in the neurips paper, and I've uploaded it HERE.

Running the below testing script (__test.py) by :

python __test.py configs/train_gata_gtf_rl.yaml

from agent import Agent
import generic
import evaluate
import reinforcement_learning_dataset


def test():

    config = generic.load_config()
    agent = Agent(config)
    output_dir = "."
    data_dir = "."

    # make game environments
    requested_infos = agent.select_additional_infos()
    games_dir = "./"

    eval_env, num_eval_game = reinforcement_learning_dataset.get_evaluation_game_env(games_dir + config['rl']['data_path'],
                                                                                     3,  # difficulty level
                                                                                     requested_infos,
                                                                                     agent.eval_max_nb_steps_per_episode,
                                                                                     agent.eval_batch_size,
                                                                                     valid_or_test="test")

    agent.load_pretrained_model(data_dir + "/gtf_text_lv1.pt.pt", load_partial_graph=False)


    eval_game_points, eval_game_points_normalized, eval_game_step, _, detailed_scores = evaluate.evaluate(eval_env, agent, num_eval_game)
    print(eval_game_points, eval_game_points_normalized, eval_game_step)


if __name__ == '__main__':
    test()

I got the below scores:

EVAL: rewards: 3.700 | normalized reward: 0.925 | used steps: 7.700
game name: 3ZMbUk7oiJW3sav6iQ0Yd, reward: 4.000, normalized reward: 1.000, steps: 10.000
game name: ZW9Eu55OC3MbHvn1hD0XJ, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: 2OyWcRqouyb5HLbaHLOJ, reward: 4.000, normalized reward: 1.000, steps: 18.000
game name: pE1dCOXKi7B7Fq0YSOPD, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: a6EdHb3bimbjuqaaIrag, reward: 4.000, normalized reward: 1.000, steps: 18.000
game name: DK38HLJVtRRyi3d7t2DXZ, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: BGbRCRVDsM3Lt71mho5N6, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: 0nQyHWbvh6dXFPmhLKX, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: PB3dhaRmhbZ5C6ERCeED, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: r69eHo3vFnk6srk2soaB, reward: 4.000, normalized reward: 1.000, steps: 10.000
game name: qnxpTadcKEKHqyeixvK, reward: 1.000, normalized reward: 0.250, steps: 4.000
game name: Y8oktyOLCkBLTglqh7pJN, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: QZ7WiY3bF36Jt7jbiKog, reward: 4.000, normalized reward: 1.000, steps: 7.000
game name: QY9MSo7dhR0YhX73FNOE3, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: O7yoUjMDtn8pSQK8IYYr, reward: 1.000, normalized reward: 0.250, steps: 5.000
game name: 0vYxsEKdIW0VIMx2SJnO, reward: 4.000, normalized reward: 1.000, steps: 8.000
game name: rNyeUk5rU5Mbcbk1cdvnV, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: ZO5gUy8LC2ejHRO9irkB, reward: 4.000, normalized reward: 1.000, steps: 5.000
game name: E0qyI5kvcjbjUBj2SOqj, reward: 4.000, normalized reward: 1.000, steps: 6.000
game name: adleiR2mfoMJCKr9CYgg, reward: 4.000, normalized reward: 1.000, steps: 5.000

It is clear that as you mentioned, the train/valid/test sets are too small so that the distribution shift may be a non negligible problem, but It's also nice to re-confirm the testing scores are not obtained by bug. We plan to provide a full set of experiment checkpoints we used to produce the testing scores at neurips in the near future.

Answer 49 · 2021-08-06T07:27:35.000Z

@xingdi-eric-yuan I tried the provided checkpoint and got the same result with you. But how to use the current codes to train such a checkpoint is still unclear to me. It seems a little bit luck is needed.
And I am glad to see you can release the checkpoints in the near future. Thank you so much and looking forward to it!

Answer 50 · 2021-08-06T13:24:41.000Z

@tongzhoumu thank you for reporting this issue, I'll close it for now and let you know when we release the checkpoints.