thammegowda/mtdata

Anuvaad-zee-30042021-eng-ben ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben: en-bn/*.en matched []; expected one file

XapaJIaMnu opened this issue · 1 comments

mtdata get -l bn-en -tr Anuvaad-zee-30042021-eng-ben -o Anuvaad-zee-30042021-eng-ben --compress
2022-02-28 14:43:08 entry.lang_pair:24 INFO:: Suggestion: Use codes ben-eng instead of bn-en. Let's make a little space for all languages of our planet 😢.
2022-02-28 14:43:08 main.get_data:32 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-02-28 14:43:08 __init__.get_instance:48 INFO:: Loading index from cache /home/nikolay/.mtdata/mtdata.index.0.3.3.pkl
2022-02-28 14:43:10 cache.__post_init__:34 INFO:: Local cache is at /home/nikolay/.mtdata
2022-02-28 14:43:10 data.add_parts:280 ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben:  en-bn/*.en matched []; expected one file
2022-02-28 14:43:10 data.add_parts:283 WARNING::  en-bn/*.en matched []; expected one file

This seems to be an issue for a few of the Anuvaad* datasets. Also confirmed for Anuvaad-toi-20210320-eng-ben, Anuvaad-anuvaad_general-corpus-eng-ben,mtdata_Anuvaad-prothomalo_2014-2020-eng-ben, Anuvaad-ik_2021-v1-eng-ben

Thanks for reporting. Anuvaad corpus has isconsistent format and IDs.
I notified them: project-anuvaad/anuvaad-parallel-corpus#1
but I got no reply.

I ended up adding them with best effort to fix inconsistencies. So a few datset IDs are failing.

Here is the relevant code:

assert url.startswith('http') and url.endswith('.zip')
file_name = url.split('/')[-1]
file_name = file_name[:-4] # .zip
char_count = coll.Counter(list(file_name))
n_hyps = char_count.get('-', 0)
n_unders = char_count.get('_', 0)
if n_hyps > n_unders:
parts = file_name.split('-')
else:
assert '_' in file_name
parts = file_name.split('_')
name, version= '?', '?'
l1, l2 = 'en', '?'
if parts[-2] == l1 and parts[-1] in langs:
l2 = parts[-1]
version = parts[-3]
elif parts[-3] == l1 and parts[-2] in langs:
l2 = parts[-2]
version = parts[-1]
else:
log.warn(f"Unable to parse {file_name} :: {parts}")
continue
name = '_'.join(parts[:-3])
name = name.replace('-', '_')
f1 = f'{l1}-{l2}/*.{l1}'
f2 = f'{l1}-{l2}/*.{l2}'
if name == 'wikipedia':
f1 = f'{l1}-{l2}/{l1}.txt'
f2 = f'{l1}-{l2}/{l2}.txt'

If you find a simple fix, please send a pull request.