fastai/fastai2

Cannot override tokenizer with TextBlock (tok_func does not exist as an argument anymore)

shimsan opened this issue · 1 comments

Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES

fastai2 0.0.25
fastcore-0.1.30
sentencepiece-0.1.86

Describe the bug

The functionality to override tokenizer is missing now in 0.0.25.
Previously, it used to be tok_func like below:

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True, tok_func=SentencePieceTokenizer),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

To Reproduce

Colab examples provided below.

Example of how we used to override tokenizer and train in fastai2-0.0.20 (but it fails at inference same as #424)

https://colab.research.google.com/drive/1Typ_xZWg5Jds-WDP8v2lEwPAoB2EKbn2?usp=sharing

Failing example with fastai2-0.0.25:
https://colab.research.google.com/drive/1m7eq3sC8pJBIi79j_hoe-8QOWOfGp1-9?usp=sharing

Expected behavior

Expected to be able to override tokenizer function and run inference

Error with full stack trace

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-cf7c257943cd> in <module>()
----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok_func=SentencePieceTokenizer), CategoryBlock),
      2                  get_items=get_text_files,
      3                  get_y=parent_label,
      4                  splitter=GrandparentSplitter(valid_name='test'))

2 frames
/usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in from_folder(cls, path, vocab, is_lm, seq_len, backwards, min_freq, max_vocab, **kwargs)
    210     def from_folder(cls, path, vocab=None, is_lm=False, seq_len=72, backwards=False, min_freq=3, max_vocab=60000, **kwargs):
    211         "Build a `TextBlock` from a `path`"
--> 212         return cls(Tokenizer.from_folder(path, **kwargs), vocab=vocab, is_lm=is_lm, seq_len=seq_len,
    213                    backwards=backwards, min_freq=min_freq, max_vocab=max_vocab)
    214 

/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in from_folder(cls, path, tok, rules, **kwargs)
    274         path = Path(path)
    275         if tok is None: tok = WordTokenizer()
--> 276         output_dir = tokenize_folder(path, tok=tok, rules=rules, **kwargs)
    277         res = cls(tok, counter=(output_dir/fn_counter_pkl).load(),
    278                   lengths=(output_dir/fn_lengths_pkl).load(), rules=rules, mode='folder')

/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in tokenize_folder(path, extensions, folders, output_dir, skip_if_exists, **kwargs)
    182     files = get_files(path, extensions=extensions, recurse=True, folders=folders)
    183     def _f(i,output_dir): return output_dir/files[i].relative_to(path)
--> 184     return _tokenize_files(_f, files, path, skip_if_exists=skip_if_exists, **kwargs)
    185 
    186 # Cell

TypeError: _tokenize_files() got an unexpected keyword argument 'tok_func'

When using tok:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-31c08ee3d74f> in <module>()
----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok=SentencePieceTokenizer), CategoryBlock),
      2                  get_items=get_text_files,
      3                  get_y=parent_label,
      4                  splitter=GrandparentSplitter(valid_name='test'))

5 frames
/usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in setup(self, items, rules)
    355         from sentencepiece import SentencePieceProcessor
    356         if rules is None: rules = []
--> 357         if self.tok is not None: return {'sp_model': self.sp_model}
    358         raw_text_path = self.cache_dir/'texts.out'
    359         with open(raw_text_path, 'w') as f:

AttributeError: 'L' object has no attribute 'tok'

Additional context
#424

jph00 commented

Fixed in master. Note that it should be tok=SentencePieceTokenizer() (i.e with parens) now, since you pass a tok, not a tok_func.