/splitter-transliteration

Python script to split the text generated by 'wikipedia parallel title extractor' into separate text files (separate file for each language)

Primary LanguagePythonMIT LicenseMIT

splitter for generating transliteration corpus

Description

  • This is Python script that uses the text file generated by 'Wikipedia Parallel Title Extractor - https://github.com/clab/wikipedia-parallel-titles' as an input.
  • This script process the input text file (mentioned above) to generate a parallel corpus.
  • Output of this script (parallel corpus) can be used to train transliteration model on MOSES.

Author

Acknowledgement

Special thanks to Dr. Rao Muhammad Adeel Nawab and Sir Muhammad Sharjeel for their continous support.

Usage

  • Download the script file (splitter.py)
  • Copy the input file (generated by wikipedia parallel title script) in same directory
  • run the terminal/cmd command 'python splitter.py '
  • Two output files will be generated for each language seperately.

Caution