thammegowda/mtdata

Cannot Download wmt21 en2zh test data

Pzzzzz5142 opened this issue · 5 comments

here is my mtdata.recipes.wmt22-constrained.yaml config

- id: wmt22-zhen-t
  langs: zho-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:
  test:
    - Statmt-newstest_enzh-2021-eng-zho
  train:

when download the test set using the following command,

mtdata get-recipe -ri wmt22-zhen-t -o .

it will raise error, and here is the error log.

2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?

it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

image

the code cause this issue is at sgm.py line 79.

srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'

Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too

Thanks for reporting this.
Sorry for the delay; I was on vacation and away GitHub.
I will try to fix this issue soon and release a new version.

Thanks, @khayrallah for the pointer!

You are right, WMT21 test refs have multiple translators, which is different from the previous years.

What is causing the delay is that not all files have multiple refs, and when we do have multiple refs, not all translators translate every segment. I will need a bit more time to fix it properly.

$ for i in ~/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.*xml; 
  do basename $i; grep -o 'translator="[^"]*"' $i | sort | uniq -c ;  done 
  
newstest2021.cs-en.xml
    167 translator="A"
     62 translator="B"
newstest2021.de-en.xml
     67 translator="A"
     61 translator="B"
newstest2021.de-fr.xml
     61 translator="A"
newstest2021.en-cs.xml
    201 translator="A"
     68 translator="B"
newstest2021.en-de.xml
     74 translator="A"
     68 translator="C"
     68 translator="D"
newstest2021.en-ha.xml
   3524 translator="A"
newstest2021.en-is.xml
     65 translator="A"
newstest2021.en-ja.xml
     65 translator="A"
newstest2021.en-ru.xml
     77 translator="A"
     68 translator="B"
newstest2021.en-zh.xml
     77 translator="A"
     68 translator="B"
newstest2021.fr-de.xml
     74 translator="A"
newstest2021.ha-en.xml
   3559 translator="A"
newstest2021.is-en.xml
     47 translator="A"
newstest2021.ja-en.xml
     81 translator="A"
newstest2021.ru-en.xml
    116 translator="A"
    107 translator="B"
newstest2021.zh-en.xml
    165 translator="A"

thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.

Thanks for the suggestion! I have sent a pull request to wmt22 page. When it is merged, we will see a note under “limitations” section.