nlp-uoregon/trankit

download_missing_files seems not to provide all missing files, in particular *_mwt_expander.pt (dutch)

tcbrouwer opened this issue · 1 comments

After training on a seperate machine we got some promising results, and we are now looking to move our model into production. However we encounter an issue. Downloading missing files and verifying the model like this:

# First we download any missing files and verify the pipeline

import trankit

# Download any missing files
trankit.download_missing_files(
	category='customized-mwt-ner', 
	save_dir='./trankit_model', 
	embedding_name='xlm-roberta-base', 
	language='dutch'
)

# Verify the pipeline
trankit.verify_customized_pipeline(
    category='customized-mwt-ner', # pipeline category
    save_dir='./trankit_model', # directory used for saving models in previous steps
    embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)

Leads to the following output and error:

Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner_mwt_expander.pt
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner_lemmatizer.pt
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner.ner.mdl
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner.ner-vocab.json
http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/dutch.zip
Downloading: 100%|██████████| 46.3M/46.3M [01:07<00:00, 682kiB/s] 
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 6
      3 import trankit
      5 # Download any missing files
----> 6 trankit.download_missing_files(
      7 	category='customized-mwt-ner', 
      8 	save_dir='./trankit_model', 
      9 	embedding_name='xlm-roberta-base', 
     10 	language='dutch'
     11 )
     13 # Verify the pipeline
     14 trankit.verify_customized_pipeline(
     15     category='customized-mwt-ner', # pipeline category
     16     save_dir='./trankit_model', # directory used for saving models in previous steps
     17     embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
     18 )

File ~/Projects/UDParserEvaluation/venv/lib/python3.10/site-packages/trankit/__init__.py:71, in download_missing_files(category, save_dir, embedding_name, language)
     69 tgt_dir = os.path.join(save_dir, embedding_name, category)
     70 for fname in missing_filenamess:
---> 71     copyfile(os.path.join(src_dir, fname.format(language)), os.path.join(tgt_dir, fname.format(category)))
     72     print('Copying {} to {}'.format(
     73         os.path.join(src_dir, fname.format(language)),
     74         os.path.join(tgt_dir, fname.format(category))
     75     ))
     76 remove_with_path(src_dir)

File /usr/lib/python3.10/shutil.py:254, in copyfile(src, dst, follow_symlinks)
    252     os.symlink(os.readlink(src), dst)
    253 else:
--> 254     with open(src, 'rb') as fsrc:
    255         try:
    256             with open(dst, 'wb') as fdst:
    257                 # macOS

FileNotFoundError: [Errno 2] No such file or directory: './trankit_model/xlm-roberta-base/dutch/dutch_mwt_expander.pt'

No file named *_mwt_expander.pt seems to be present.

I tried to download a few zips from http://nlp.uoregon.edu/download/trankit/ and it's subfolders, but no luck finding any mwt_expander.

Am I missing something?

The model was trained like this:

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'posdep', # task name
    'save_dir': './trankit_model', # directory for saving trained model
    'train_conllu_fpath': './corpus/split-conllu/train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './corpus/split-conllu/dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

For now, we have chosen to run a model of the "customized" type instead of the "customized-mwt-ner" type. For "customized" all missing files seem to be downloaded correctly.

https://trankit.readthedocs.io/en/latest/training.html