/Arabic-Word2Vec-Wikipedia

Arabic Word-Embedding (Word2vec) model training from Wikipedia articles

Primary LanguagePythonMIT LicenseMIT

Arabic Word2Vec from Wikipedia

Arabic Word-Embedding (Word2vec) model training from Wikipedia articles

Steps to start training:-

1- Got to Wikipedia Arabic articles data dump at this URL:-

https://dumps.wikimedia.org/arwiki/latest/

2- Download just Articles only, looks like this:-

arwiki-latest-pages-articles-multistream.xml.bz2

about 1 GB approximately

3- Use WikiExtractor to extract articles to json files

https://github.com/attardi/wikiextractor

4- Run arabic_word2vec.py to get your Model.

Enjoy Arabic Word-Embedding (Word2vec) ;-)

5- Use my repository https://github.com/rozester/Arabic-Word-Embeddings-Word2vec to visualize it in action.

Thanks to Abed Khooli for his function (ArTokenizer) was very helpful in Arabic Text Cleansing

https://github.com/abedkhooli

Watch it on action

Arabic Word Embeddings Word2vec