hemingkx/ChineseNMT

No valid references for a sentence

wwong31 opened this issue · 8 comments

Thank you very much for your code.

I have downloaded your code and executed get_corpus.py and tokenize.py successfully.
But when I try to execute main.py, I got the following error:
EOFError: No valid references for a sentence!
Do you know why I am getting this error?

Also, why do you set the res variable as list?
In train.py:
res = [res]

Any help you can provide, in Chinese or English, will be greatly appreciated. Thank you.

Hi! Sorry for the late reply!
I re-downloaded this code and ran it on my pc. I did not get the error you mentioned above. Could you paste your detailed training log (including the error information) here? Maybe you can check if the path of tokenize.py and get_corpus.py is correct.

In train.py, res = [res] is set to satisfy sacrebleu's input format.

嘻嘻,祝您项目顺利~

Hello! Thank you for re-downloading your code and re-running it on your PC. So, I thought it might be the issue with my package, so I updated my Torch and sacrebleu to:
Torch: 1.8.1+cu102
sacrebleu: 1.5.1

I re-compiled tokenize.py and get_corpus.py successfully, with a small addition of , encoding="utf-8" to lines 16 and 19 of get_corpus.py:
with open(ch_path, "w", encoding="utf-8") as fch:
with open(en_path, "w", encoding="utf-8") as fen:

But I still get an error when I run the code, which I trace it back to line 170 from train.py, and "res" is not passed to sacrebleu correctly:
bleu = sacrebleu.corpus_bleu(trg, res, tokenize='zh')

Here's the error log from the command line:
-------- Dataset Build! --------
-------- Get Dataloader! --------
0%| | 0/5530 [00:00<?, ?it/s]
100%|██████████| 5530/5530 [39:37<00:00, 2.33it/s]
Epoch: 1, loss: 7.671135425567627
100%|██████████| 790/790 [02:19<00:00, 5.67it/s]
100%|██████████| 790/790 [1:09:28<00:00, 5.28s/it]sys_stream length: 25278

Traceback (most recent call last):
File "...\ChineseNMT\main.py", line 106, in
run()
File " ...\ChineseNMT\main.py", line 76, in run
train(train_dataloader, dev_dataloader, model, model_par, criterion, optimizer)
File " ...\ChineseNMT\train.py", line 43, in train
bleu_score = evaluate(dev_data, model)
File "...\ChineseNMT\train.py", line 170, in evaluate
bleu = sacrebleu.corpus_bleu(trg, res, tokenize='zh')
File "....conda\envs\my_env\lib\site-packages\sacrebleu\compat.py", line 36, in corpus_bleu
sys_stream, ref_streams, use_effective_order=use_effective_order)
File "....conda\envs\my_env\lib\site-packages\sacrebleu\metrics\bleu.py", line 279, in corpus_score
print('ref_stream length:', len(ref_stream))

NameError: name 'ref_stream' is not defined

Here's the info from the log:
2021-03-31 04:51:49,951:INFO: -------- Dataset Build! --------
2021-03-31 04:51:49,956:INFO: -------- Get Dataloader! --------
2021-03-31 05:31:29,517:INFO: Epoch: 1, loss: 7.671135425567627

Thank you very much!

Hi, it can be seen from the log that the error is caused by your sacrebleu package. Please make sure that you run sacrebleu successfully before running the whole NMT project. We suggest you use the example from the official sacrebleu repo to test your sacrebleu package.

You can simply run the code below:

import sacrebleu

refs = [['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
        ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.']]
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

bleu = sacrebleu.corpus_bleu(sys, refs)
print(bleu.score)

By the way, from the log information, the error means that Variable 'ref_stream' is not defined in corpus_score function in sacrebleu\metrics\bleu.py. If you use PyCharm, you can check the original code of sacrebleu by selecting the function name and pressing 'Ctrl+B'.

I created a new virtual environment and installed the newest version of sacrebleu==1.5.1 by the command:

pip install sacrebleu

I ran the example above successfully and did not meet the error you get. I paste the original code of corpus_score function in sacrebleu\metrics\bleu.py here:

def corpus_score(self, sys_stream: Union[str, Iterable[str]],
                 ref_streams: Union[str, List[Iterable[str]]],
                 use_effective_order: bool = False) -> BLEUScore:
        """Produces BLEU scores along with its sufficient statistics from a source against one or more references.
    
        :param sys_stream: The system stream (a sequence of segments)
        :param ref_streams: A list of one or more reference streams (each a sequence of segments)
        :param use_effective_order: Account for references that are shorter than the largest n-gram.
        :return: a `BLEUScore` object containing everything you'd want
        """

        # Add some robustness to the input arguments
        if isinstance(sys_stream, str):
                sys_stream = [sys_stream]

        if isinstance(ref_streams, str):
                ref_streams = [[ref_streams]]

        sys_len = 0
        ref_len = 0

        correct = [0 for n in range(self.NGRAM_ORDER)]
        total = [0 for n in range(self.NGRAM_ORDER)]

        # look for already-tokenized sentences
        tokenized_count = 0

        # sanity checks
        if any(len(ref_stream) != len(sys_stream) for ref_stream in ref_streams):
                raise EOFError("System and reference streams have different lengths!")
        if any(line is None for line in sys_stream):
                raise EOFError("Undefined line in system stream!")

        for output, *refs in zip(sys_stream, *ref_streams):
                # remove undefined/empty references (i.e. we have fewer references for this particular sentence)
                # but keep empty hypothesis (it's always defined thanks to the sanity check above)
                lines = [output] + [x for x in refs if x is not None and x != ""]
                if len(lines) < 2:  # we need at least hypothesis + 1 defined & non-empty reference
                        raise EOFError("No valid references for a sentence!")

                if self.lc:
                        lines = [x.lower() for x in lines]

                if not (self.force or self.tokenizer.signature() == 'none') and lines[0].rstrip().endswith(' .'):
                        tokenized_count += 1

                        if tokenized_count == 100:
                                sacrelogger.warning('That\'s 100 lines that end in a tokenized period (\'.\')')
                                sacrelogger.warning(
                                        'It looks like you forgot to detokenize your test data, which may hurt your score.')
                                sacrelogger.warning(
                                        'If you insist your data is detokenized, or don\'t care, you can suppress this message with \'--force\'.')

                output, *refs = [self.tokenizer(x.rstrip()) for x in lines]

                output_len = len(output.split())
                ref_ngrams, closest_diff, closest_len = BLEU.reference_stats(refs, output_len)

                sys_len += output_len
                ref_len += closest_len

                sys_ngrams = BLEU.extract_ngrams(output)
                for ngram in sys_ngrams.keys():
                        n = len(ngram.split())
                        correct[n - 1] += min(sys_ngrams[ngram], ref_ngrams.get(ngram, 0))
                        total[n - 1] += sys_ngrams[ngram]

        # Get BLEUScore object
        score = self.compute_bleu(
                correct, total, sys_len, ref_len,
                smooth_method=self.smooth_method, smooth_value=self.smooth_value,
                use_effective_order=use_effective_order)

        return score

You can check the original code by yourself and find out what causes the error 😊.

Thank you very much for your detailed explanation and suggestions.

After investigating into the sacrebleu package/files and re-creating my environment a few times, I found out that the problem is caused by the way I install sacrebleu, which was:
pip install sacrebleu

Instead of using pip, I tried using conda (because I use Anaconda to create my environment), and your ChineseNMT code compiled and worked:
conda install -c conda-forge sacrebleu

Another note about compiling, as with get_corpus.py, in order for main.py code to compile, I added ,encoding="utf-8" to line 165 of train.py:
with open(config.output_path, "w", encoding="utf-8") as fp:

Thank you again for your help. And thank you for sharing your excellent code on GitHub!

Glad for hearing that you solved your problems! Yeah, now the codes released are only tested successfully with Linux. If you wanna try it with Windows, you should add encoding="utf-8" as you mentioned. Thank you very much for noticing all the issues here! Hope your project goes well~😊

There is one more thing that I changed. In train.py (lines 169 and 170), instead of:
res = [res]
bleu = sacrebleu.corpus_bleu(trg, res, tokenize='zh')

I changed it to:
trg = [trg]
bleu = sacrebleu.corpus_bleu(res, trg, tokenize='zh')

Because in the example from https://github.com/mjpost/sacrebleu:

import sacrebleu
refs = [['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.']]
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
bleu = sacrebleu.corpus_bleu(sys, refs)
print(bleu.score)

refs refers to the reference translations, and sys refers to the translation predicted by the NMT.

And in your ChineseNMT, trg is the reference translation, and res is the translation predicted by NMT, right?

Technically, since you only have one set of reference translations, your way would work, and it worked for me a couple times. But for the sacrebleu to work all the times, this change seems necessary. Hope this helps others who try your ChineseNMT to debug.

Unrelated to this issue, just wondering, where did you get your train, dev and test datasets? Any reference will be greatly appreciated.

Thank you very much for your contribution to this repo! I will check the issues that you mentioned above and fix the bugs ~. I asked our TA and he said the dataset is from WMT 2018 Chinese-English track (Only NEWS Area), which you can find here. I will update the source of the dataset in README.md tomorrow. Appreciate your awesome work again ~!