rpryzant/delete_retrieve_generate

Training issue with another dataset

zhendong3wang opened this issue · 7 comments

Hi Reid,

I'm currently trying to train the 'delete' model with another dataset - I have manually collected 6000 tweets and 4000 news texts for the purpose of informal/formal text style transfer.

Firstly I followed the steps for data preparation:

python tools/make_vocab.py [entire corpus file (src + tgt cat'd)] [vocab size] > vocab.txt
python tools/make_attribute_vocab.py vocab.txt [corpus src file] [corpus tgt file] [salience ratio] > attribute_vocab.txt
python tools/make_ngram_attribute_vocab.py vocab.txt [corpus src file] [corpus tgt file] [salience ratio] > attribute_vocab.txt

(Not sure about how to set a good 'salience ratio', but I saw you had '15' in the example file ngram.15.attribute, so I adopted '15' in my test. Maybe you have a better suggestion of it?)

Then I simply modified the data section in the config file as following:

...
"data": {
    "src": "data/sports_text/train.tweets",
    "tgt": "data/sports_text/train.news",
    "src_test": "data/sports_text/test.tweets",
    "tgt_test": "data/sports_text/test.news",
    "src_vocab": "data/sports_text/vocab.txt",
    "tgt_vocab": "data/sports_text/vocab.txt",
    "share_vocab": true,
    "attribute_vocab": "data/sports_text/ngram_attribute_vocab15.txt",
    "ngram_attributes": true,
    "batch_size": 256,
    "max_len": 50,
    "working_dir": "working_dir2"
  },
...

After that, I run the training script and encountered the following error:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.7/3.7.9/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python@3.7/3.7.9/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/zhwa/.vscode/extensions/ms-python.python-2020.9.114305/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/Users/zhwa/.vscode/extensions/ms-python.python-2020.9.114305/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/Users/zhwa/.vscode/extensions/ms-python.python-2020.9.114305/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "/usr/local/Cellar/python@3.7/3.7.9/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/Cellar/python@3.7/3.7.9/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/Cellar/python@3.7/3.7.9/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/zhwa/Projects/delete_retrieve_generate/train.py", line 209, in <module>
    model, src_test, tgt_test, config)
  File "/Users/zhwa/Projects/delete_retrieve_generate/src/evaluation.py", line 183, in evaluate_lpp
    is_test=True)
  File "/Users/zhwa/Projects/delete_retrieve_generate/src/data.py", line 293, in minibatch
    out_dataset['data'], out_dataset['tok2id'], idx, batch_size, max_len, idx=inputs[-1])
  File "/Users/zhwa/Projects/delete_retrieve_generate/src/data.py", line 261, in get_minibatch
    lens = [lens[j] for j in idx]
  File "/Users/zhwa/Projects/delete_retrieve_generate/src/data.py", line 261, in <listcomp>
    lens = [lens[j] for j in idx]
IndexError: list index out of range

I run the debug mode and found out that j was 217 while lens only contained 198 items in minibatch(). But I couldn't figure out why this happened after several tryings.

It feels like something wrong with my config set-ups for "batch_size": 256, "max_len": 50, but I'm not sure. Could you help me provide some insight of how to fix this issue?

Thanks in advance,
//Zhendong

Thanks for reaching out!

That's the salience ratio they used in the paper, and it is a good place to start. Intuitively that ratio says how strongly associated with each class do you want the attribute ngrams to be. Higher numbers means that the attribute vocab will be more strongly associated with each class, but also that you will have fewer vocab items because the threshold is tighter.

About your error, it's hard to tell exactly whats going on but it looks like the system is trying to grab examples past the edge of your data. How large are each of your datasets? Did you try with a different batch size? Does this happen right away, or after it's been working for a little while?

Thank you for your swift reply. And good explanation with the salience ratio! Helped me understand it better now.

The file sizes are not that large, about around 100kb~400kb for different ones (including 6000 tweets and 4000 news sentences in training and ~1000 sentences for both in testing). The error just happened in the evaluation part after training the first epoch (assume that's counted as 'right away'). Hmmm no, I haven't tried with different batch size, not sure if that would be the problem. I will give it a try now and see if it could help fix the issue.

//Z

Hi, I have just tried with different batch size, ranging from 80, 128, 512 and 1024... But still no luck with
executing this line lens = [lens[j] for j in idx] from outputs = get_minibatch(out_dataset['data'], out_dataset['tok2id'], idx, batch_size, max_len, idx=inputs[-1]) in method evaluate_lpp(model, src, tgt, config).

It seemed that with every batch size, there was always a j from idx which was out of range of list lens... I'm actually confused now, not sure if it's a potential bug in the code or something wrong with my data or configuration.

Try cutting it down so that # tweets = # news articles? I think this might be a known issue where the datasets have to be the same size. If that doesn't help feel free to email me (email on my website) and i will help debug :)

Thanks for the information Reid!

I just tried to cut down the size of testing data for tweets, so now # tweets and # news are even. And it worked! Took a quick local test and it run smoothly to the second epoch (at least not complaining in the first evaluation). Just me mumbling, it seems that the issue is from here, the idx was kept from # news and couldn't fit with # tweets:

        inputs = get_minibatch(
            in_dataset['content'], in_dataset['tok2id'], idx, batch_size, max_len, sort=True)
        outputs = get_minibatch(
            out_dataset['data'], out_dataset['tok2id'], idx, batch_size, max_len, idx=inputs[-1])

But I guess you might know this issue as you talked about. Anyways, I will try to run the whole training process in the server with GPU tomorrow to see if the program is still happy there. I will get back to this thread for the update (or email you for more details if the same bug occurs again). Have a nice day :)

//Z

Hi, just wanted to follow up on the thread - the training process worked smoothly in the server after equalizing the size of two test datasets. (Though the results were not that good. Guess it has something to do with the model convergence, and I will spend more time to see if some method tweaking would help.)

Thanks again for your implementation and the help of explanations, really appreciated 👍

//Zhendong

No problem!