/shiba

Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.

Primary LanguagePythonOtherNOASSERTION

Forked Repo!

This repo is a fork of https://github.com/octanove/shiba, edited to allow :

Files edited:

  • training/train.py
  • training/helpers.py
  • README.md of course
  • requirements.txt: added clearml requirement

Files added:

  • training/finetune_ner.py
  • training/finetune_word_segmentation_on_masakhaner.py
  • masakhaner_fork_loading_script.py

Running word segmentation on masakhaner using pretrained pytorch_model trained on hf_swahili_no_spaces

python finetune_word_segmentation_on_masakhaner.py --output_dir ./runs/wordseg \
  --resume_from_checkpoint ./hf_swahili_no_spaces_5k_steps/pytorch_model.bin \
  --num_train_epochs 3 \
  --save_strategy no

finetuning NER:

python finetune_ner.py --output_dir ./runs/masakhaner \
  --resume_from_checkpoint ./hf_swahili_no_spaces_5k_steps/pytorch_model.bin \
  --num_train_epochs 2 \
  --logging_steps 50 \
  --debug underflow_overflow \
  --report_to tensorboard \
  --save_strategy epoch

Licensing notice from original repo

The code and contents of the original repository are provided under the Apache License 2.0. The pretrained model weights are provided under the CC BY-SA 4.0 license.

How to cite this work

There is no paper associated with SHIBA, but the repository can be cited like this:

@misc{shiba,
  author = {Joshua Tanner and Masato Hagiwara},
  title = {SHIBA: Japanese CANINE model},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/octanove/shiba}},
}

Please also cite the original CANINE paper:

@misc{clark2021canine,
      title={CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation}, 
      author={Jonathan H. Clark and Dan Garrette and Iulia Turc and John Wieting},
      year={2021},
      eprint={2103.06874},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}