/bangla-hunspell

A project a new bn-in spellchecker .dic and .aff files using the bangla akademi and the suddho words from wikisource.

Primary LanguagePython

Suddo Bangla-hunspell

To Do

  • Generate word frequency lists from corpus of old books in wikisource(in progess)
  • To understand the dic aff format | chromium developers | Ubuntu manpages | Source documentation.
  • Find a way to test find word coverage, preferably in firefox or libre writer.
  • Use wikisource to classify words in to parts of speach (helps with suffixies)

Progess

  1. Generate word frequency lists from the books proofread by bn.wikisource.
    1. Download the epub files by hand from wikisource to here,(machine downloads not permited).
    2. Convert them to txt by using epub_to_txt.sh
    3. Generate the most frequent words using word_frequency.py .
  2. Test word coverge using analyze like this.
  3. Post made at wikisource requesting help to transcribe dictionaries.
  4. To view bangla with joint glyphs(jukthakhor) in terminal, use konsole. Use a suitable font (I use MesloLGS NF) and enable Bramhic script charactes as follows. Menu>settings> configure Konsole> Profiles> new Profile> Edit> Appearance > Complex Text Layout Check Bramhic Script Charactes.

Resources

Online Resourses
Description

Most of the .dic and .aff files have been extracted and placed in the resources folder. To open any such plugins for firefox, thunderbird or libre office use any archive manager. The Bangla Akademi word list published by SNLTR is in .doc format, it has been converted to .csv for better utility. Other than that their dictionaries use only the .dic file mainly, so it doesn't take advantage of the .aff file for compression hence has very low coverage. However I am not well versed in java to understand what they are doing with that plugin. Anyhow, the most important resource of all is the .dic and .aff files from Bangla Type Foundry. They have done a tremendous job of embedding the grammer rules of the Bangla language into the dic-aff format. The idea would be to create a bn-in dictionary following those methods, taking into account the old words(suddo).