/sanskrit

Word Segmentation for Sanskrit (Unsandhied output from Sandhied sentences)

Primary LanguagePython

Sanskrit Word Segmentation by CopyNet + nmt (tensorflow)

Requirements:

  • Python 2.7.
  • subword-nmt
pip install subword-nmt
  • tensorflow > 1.5.0 (consider to use the latest [Ana-]conda release)

Instruction to generate Byte Pair Encoded (BPE) data

  • Change directory to nmt/BPE_2ch
  • Unzip "data_split.tar.gz" and "combined.tar.gz"
  • run the "inps.sh" file
./inps.sh

Testing while Training

python2.7 -m nmt.nmt --copynet=True --share_vocab=True --attention=scaled_luong --src=src --tgt=trg --vocab_prefix=nmt/BPE_2ch/vocab --train_prefix=nmt/BPE_2ch/train_BPE --test_prefix=nmt/BPE_2ch/test_BPE --dev_prefix=nmt/BPE_2ch/valid_BPE --out_dir=nmt/models_2ch/ --num_train_steps=80000 --steps_per_stats=100 --encoder_type=bi --num_layers=4 --num_units=128 --dropout=0.4 --metrics=bleu --num_keep_ckpts=80 --decay_scheme=luong10 --batch_size=64 --src_max_len=150 --tgt_max_len=150

Only Training

python2.7 -m nmt.nmt --copynet=True --share_vocab=True --attention=scaled_luong --src=src --tgt=trg --vocab_prefix=nmt/BPE_2ch/vocab --train_prefix=nmt/BPE_2ch/train_BPE --dev_prefix=nmt/BPE_2ch/valid_BPE --out_dir=nmt/models_2ch/ --num_train_steps=80000 --steps_per_stats=100 --encoder_type=bi --num_layers=4 --num_units=128 --dropout=0.4 --metrics=bleu --num_keep_ckpts=80 --decay_scheme=luong10 --batch_size=64 --src_max_len=150 --tgt_max_len=150

Only Testing

python2.7 -m nmt.nmt --out_dir=nmt/models_2ch --inference_input_file=nmt/infer_file.L1 --inference_output_file=nmt/models_2ch/output_infer

Vocabulary Setting

Since in copynet scenarios the target sequence contains words from source sentences, the best choice is to use a shared vocabulary for source vocabulary and target vcabulary. And we also use a parameter generated vocabulary size, namely, the number of target vocabulary excluding words from source sequences (copied words), to indicate that the first N(=generated vocabulary size) words in shared vocabulary are in generate mode and target word indexes larger than N are copied.

In this codebase, vocab_size and gen_vocab_size are variables representing shared vocabulary size and generated vocabulalry size.