Persian soundex algorithm for phonetic spell correction.
Consider two persian words شنبه and شمبه. These two words are the same with different spell, but they are pronounced almost the same. The key is in the phonetics.
In soundex algorithm, similar phonetics are assigned the same code. For example the م and ن alphabets are assigned code 5. In this way words that are similar in phonetics could get recognized and this could be used in phonetic spell correction.
In the sample_data directory there is a json file containing 5000 persian tweets. The program written in main.go tokenizes and creates clusters of words that are similar in phonetics (soundex).
$ go run .
The output is in the output directory written in soundex.txt file.
MIT