/poundex

Persian soundex algorithm for phonetic spell correction.

Primary LanguageGoMIT LicenseMIT

poundex

Description

Persian soundex algorithm for phonetic spell correction.

Table of Contents

Soundex

Consider two persian words شنبه and شمبه. These two words are the same with different spell, but they are pronounced almost the same. The key is in the phonetics.

In soundex algorithm, similar phonetics are assigned the same code. For example the م and ن alphabets are assigned code 5. In this way words that are similar in phonetics could get recognized and this could be used in phonetic spell correction.

Sample

In the sample_data directory there is a json file containing 5000 persian tweets. The program written in main.go tokenizes and creates clusters of words that are similar in phonetics (soundex).

Run Sample Program

$ go run .

The output is in the output directory written in soundex.txt file.

License

MIT