A tokenizer to convert Kannada String into a list of Kannada Akṣaras. Along with splitting, it also seggregates an Akṣara into its constituent parts.
A regular expression is used match Kannada akṣaras. Regions of interest are extracted using capturing groups. The blog explains the synthesis of regular expression and meaning of its parts in detail.
The decoder is written as a ES6 Module, which you need to import
in your code. This can be done as follows.
<script type="module">
import {AksharaTokenizerKannada} from 'https://cdn.jsdelivr.net/gh/vinayakakv/akshara_tokenizer@1.0.0/akshara_tokenizer.js'
.
.
.
</script>
First of all, you have to create an AksharaTokenizerKannada
instance.
tokenizer = new AksharaTokenizerKannada();
You can then call its tokenize(string)->string
method
- Input is a string in Kannada script
- Output is a list of tokens, where each token is an dictionary of the form
{ "svara": // svara part of akshara, "samyukta": // saṃyukta part of vyaṃjana, "vyamjana": //core vyaṃjana, "gunita": // guṇita of vyaṃjana, "virama": //virāma of whole vyaṃjana "yogawaha": //either yogavāha of whole akṣara }
Note that all of the fields of the token dictionary need not be having values simultaneously. In particular,
- the presence of
svara
means the absence of all other fields, withyogawaha
being an exception in some cases - the presence of
vyanjana
means the absence ofswara
This project is being used in
If you are using it in your project, you can submit a pull request to include it in the list!
Contributions are welcome in the form of bugfixes to the core decoder, and, extending it to other scripts without using any transcription service (e.g., by passing a script identifier to the constructor).