/PhoW2V

Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese

GNU Affero General Public License v3.0AGPL-3.0

PhoW2V: Pre-trained Word2Vec syllable and word embeddings for Vietnamese

PhoW2V provides collections of pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese, that were pre-trained on a 20GB corpus of Vietnamese texts and used for our EMNLP-2020 Findings work "A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese":

@inproceedings{phow2v_vitext2sql,
    title     	= {{A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese}},
    author    	= {Anh Tuan Nguyen and Mai Hoang Dao and Dat Quoc Nguyen},
    booktitle   = {Findings of the Association for Computational Linguistics: EMNLP 2020},
    year      	= {2020},
    pages       = {4079--4085}
}  
Pre-trained embeddings Syllable/Word Embedding size Download mirror
PhoW2V_syllables_100dims Syllable-level 100 Mirror
PhoW2V_syllables_300dims Syllable-level 300 Mirror
PhoW2V_words_100dims Word-level 100 Mirror
PhoW2V_words_300dims Word-level 300 Mirror

By downloading the PhoW2V embeddings, USER agrees:

  • To use PhoW2V for research or educational purposes only.
  • Not to distribute PhoW2V or part of PhoW2V in any original or modified form.
  • To cite our EMNLP-2020 Findings paper above when PhoW2V is employed to help produce published results.

Note

  • Users should perform Vietnamese tone normalization on downstream tasks' data as this pre-process was also applied to the 20GB pre-training corpus of Vietnamese texts. A Python script for Vietnamese tone normalization is available at HERE.