Anuvaad-zee-30042021-eng-ben ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben: en-bn/*.en matched []; expected one file
XapaJIaMnu opened this issue · 1 comments
XapaJIaMnu commented
mtdata get -l bn-en -tr Anuvaad-zee-30042021-eng-ben -o Anuvaad-zee-30042021-eng-ben --compress
2022-02-28 14:43:08 entry.lang_pair:24 INFO:: Suggestion: Use codes ben-eng instead of bn-en. Let's make a little space for all languages of our planet 😢.
2022-02-28 14:43:08 main.get_data:32 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-02-28 14:43:08 __init__.get_instance:48 INFO:: Loading index from cache /home/nikolay/.mtdata/mtdata.index.0.3.3.pkl
2022-02-28 14:43:10 cache.__post_init__:34 INFO:: Local cache is at /home/nikolay/.mtdata
2022-02-28 14:43:10 data.add_parts:280 ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben: en-bn/*.en matched []; expected one file
2022-02-28 14:43:10 data.add_parts:283 WARNING:: en-bn/*.en matched []; expected one file
This seems to be an issue for a few of the Anuvaad* datasets. Also confirmed for Anuvaad-toi-20210320-eng-ben
, Anuvaad-anuvaad_general-corpus-eng-ben
,mtdata_Anuvaad-prothomalo_2014-2020-eng-ben
, Anuvaad-ik_2021-v1-eng-ben
thammegowda commented
Thanks for reporting. Anuvaad corpus has isconsistent format and IDs.
I notified them: project-anuvaad/anuvaad-parallel-corpus#1
but I got no reply.
I ended up adding them with best effort to fix inconsistencies. So a few datset IDs are failing.
Here is the relevant code:
mtdata/mtdata/index/anuvaad.py
Lines 27 to 55 in 18aa5ac
If you find a simple fix, please send a pull request.