problem with VocabAugmentor

Question

problem with VocabAugmentor

janekzimoch opened this issue 3 years ago · 3 comments

There seems to be a problem with the VocabAugumentor class.
When I run:
new_tokens = augmentor.get_new_tokens(ft_corpus_train)
I get this error:

TypeError: Can’t convert <tokenizers.trainers.PreTrainedTokenizerFast object at 0x7f8641325570> to Sequence

Solution: I resolved this problem by switching orders of arguments on line 91 in vocab_augmentor.py. (this was a suggested fix for the above mentioned error, from a quick google search)
From:
self.rust_tokenizer.train(self.trainer, train_files)
to:
self.rust_tokenizer.train(train_files, self.trainer)

After this change program runs as expected.

I didn't look into this problem - as the quick fix solution worked - but you may want to have a look if there are no bugs related to that.

Answer 1 · 2022-07-19T12:58:27.000Z

I had a similar issue, it fixed my problem :)

Answer 2 · 2022-07-19T13:02:34.000Z

My issue was: TypeError: Can't convert <tokenizers.trainers.WordPieceTrainer object at 0x17ceef250> to Sequence

Answer 3 · 2022-07-19T13:04:03.000Z

You will need to download class VocabAugmentor(BaseEstimator): from SBERT website and change
From:
self.rust_tokenizer.train(self.trainer, train_files)
to:
self.rust_tokenizer.train(train_files, self.trainer)