Cannot override tokenizer with TextBlock (tok_func does not exist as an argument anymore)
shimsan opened this issue · 1 comments
Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES
fastai2 0.0.25
fastcore-0.1.30
sentencepiece-0.1.86
Describe the bug
The functionality to override tokenizer is missing now in 0.0.25.
Previously, it used to be tok_func
like below:
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True, tok_func=SentencePieceTokenizer),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
To Reproduce
Colab examples provided below.
Example of how we used to override tokenizer and train in fastai2-0.0.20 (but it fails at inference same as #424)
https://colab.research.google.com/drive/1Typ_xZWg5Jds-WDP8v2lEwPAoB2EKbn2?usp=sharing
Failing example with fastai2-0.0.25:
https://colab.research.google.com/drive/1m7eq3sC8pJBIi79j_hoe-8QOWOfGp1-9?usp=sharing
Expected behavior
Expected to be able to override tokenizer function and run inference
Error with full stack trace
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-cf7c257943cd> in <module>()
----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok_func=SentencePieceTokenizer), CategoryBlock),
2 get_items=get_text_files,
3 get_y=parent_label,
4 splitter=GrandparentSplitter(valid_name='test'))
2 frames
/usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in from_folder(cls, path, vocab, is_lm, seq_len, backwards, min_freq, max_vocab, **kwargs)
210 def from_folder(cls, path, vocab=None, is_lm=False, seq_len=72, backwards=False, min_freq=3, max_vocab=60000, **kwargs):
211 "Build a `TextBlock` from a `path`"
--> 212 return cls(Tokenizer.from_folder(path, **kwargs), vocab=vocab, is_lm=is_lm, seq_len=seq_len,
213 backwards=backwards, min_freq=min_freq, max_vocab=max_vocab)
214
/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in from_folder(cls, path, tok, rules, **kwargs)
274 path = Path(path)
275 if tok is None: tok = WordTokenizer()
--> 276 output_dir = tokenize_folder(path, tok=tok, rules=rules, **kwargs)
277 res = cls(tok, counter=(output_dir/fn_counter_pkl).load(),
278 lengths=(output_dir/fn_lengths_pkl).load(), rules=rules, mode='folder')
/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in tokenize_folder(path, extensions, folders, output_dir, skip_if_exists, **kwargs)
182 files = get_files(path, extensions=extensions, recurse=True, folders=folders)
183 def _f(i,output_dir): return output_dir/files[i].relative_to(path)
--> 184 return _tokenize_files(_f, files, path, skip_if_exists=skip_if_exists, **kwargs)
185
186 # Cell
TypeError: _tokenize_files() got an unexpected keyword argument 'tok_func'
When using tok
:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-31c08ee3d74f> in <module>()
----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok=SentencePieceTokenizer), CategoryBlock),
2 get_items=get_text_files,
3 get_y=parent_label,
4 splitter=GrandparentSplitter(valid_name='test'))
5 frames
/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in setup(self, items, rules)
355 from sentencepiece import SentencePieceProcessor
356 if rules is None: rules = []
--> 357 if self.tok is not None: return {'sp_model': self.sp_model}
358 raw_text_path = self.cache_dir/'texts.out'
359 with open(raw_text_path, 'w') as f:
AttributeError: 'L' object has no attribute 'tok'
Additional context
#424
Fixed in master. Note that it should be tok=SentencePieceTokenizer()
(i.e with parens) now, since you pass a tok, not a tok_func.