Anuvaad-zee-30042021-eng-ben ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben: en-bn/*.en matched []; expected one file

mtdata get -l bn-en -tr Anuvaad-zee-30042021-eng-ben -o Anuvaad-zee-30042021-eng-ben --compress
2022-02-28 14:43:08 entry.lang_pair:24 INFO:: Suggestion: Use codes ben-eng instead of bn-en. Let's make a little space for all languages of our planet 😢.
2022-02-28 14:43:08 main.get_data:32 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-02-28 14:43:08 __init__.get_instance:48 INFO:: Loading index from cache /home/nikolay/.mtdata/mtdata.index.0.3.3.pkl
2022-02-28 14:43:10 cache.__post_init__:34 INFO:: Local cache is at /home/nikolay/.mtdata
2022-02-28 14:43:10 data.add_parts:280 ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben:  en-bn/*.en matched []; expected one file
2022-02-28 14:43:10 data.add_parts:283 WARNING::  en-bn/*.en matched []; expected one file

This seems to be an issue for a few of the Anuvaad* datasets. Also confirmed for Anuvaad-toi-20210320-eng-ben, Anuvaad-anuvaad_general-corpus-eng-ben,mtdata_Anuvaad-prothomalo_2014-2020-eng-ben, Anuvaad-ik_2021-v1-eng-ben

Thanks for reporting. Anuvaad corpus has isconsistent format and IDs.
I notified them: project-anuvaad/anuvaad-parallel-corpus#1
but I got no reply.

I ended up adding them with best effort to fix inconsistencies. So a few datset IDs are failing.

Here is the relevant code:

mtdata/mtdata/index/anuvaad.py

Lines 27 to 55 in 18aa5ac

    
           assert url.startswith('http') and url.endswith('.zip') 
        
           file_name = url.split('/')[-1] 
        
           file_name = file_name[:-4]  # .zip 
        
           char_count = coll.Counter(list(file_name)) 
        
           n_hyps = char_count.get('-', 0) 
        
           n_unders = char_count.get('_', 0) 
        
           if n_hyps > n_unders: 
        
               parts = file_name.split('-') 
        
           else: 
        
               assert '_' in file_name 
        
               parts = file_name.split('_') 
        
           name, version= '?', '?' 
        
           l1, l2  = 'en', '?' 
        
           if parts[-2] == l1 and parts[-1] in langs: 
        
               l2 = parts[-1] 
        
               version = parts[-3] 
        
           elif parts[-3] == l1 and parts[-2] in langs: 
        
               l2 = parts[-2] 
        
               version = parts[-1] 
        
           else: 
        
               log.warn(f"Unable to parse {file_name} :: {parts}") 
        
               continue 
        
           name = '_'.join(parts[:-3]) 
        
           name = name.replace('-', '_') 
        
           f1 = f'{l1}-{l2}/*.{l1}' 
        
           f2 = f'{l1}-{l2}/*.{l2}' 
        
           if name == 'wikipedia': 
        
               f1 = f'{l1}-{l2}/{l1}.txt' 
        
               f2 = f'{l1}-{l2}/{l2}.txt'

If you find a simple fix, please send a pull request.

	assert url.startswith('http') and url.endswith('.zip')
	file_name = url.split('/')[-1]
	file_name = file_name[:-4] # .zip
	char_count = coll.Counter(list(file_name))
	n_hyps = char_count.get('-', 0)
	n_unders = char_count.get('_', 0)
	if n_hyps > n_unders:
	parts = file_name.split('-')
	else:
	assert '_' in file_name
	parts = file_name.split('_')
	name, version= '?', '?'
	l1, l2 = 'en', '?'
	if parts[-2] == l1 and parts[-1] in langs:
	l2 = parts[-1]
	version = parts[-3]
	elif parts[-3] == l1 and parts[-2] in langs:
	l2 = parts[-2]
	version = parts[-1]
	else:
	log.warn(f"Unable to parse {file_name} :: {parts}")
	continue
	name = '_'.join(parts[:-3])
	name = name.replace('-', '_')
	f1 = f'{l1}-{l2}/*.{l1}'
	f2 = f'{l1}-{l2}/*.{l2}'
	if name == 'wikipedia':
	f1 = f'{l1}-{l2}/{l1}.txt'
	f2 = f'{l1}-{l2}/{l2}.txt'