This software performs part of speech tagging, word segmentation, and phoneme analysis for Vietnamese (Homepage).
Basically, the toolkit solves the tasks by appling CRFs method that implemented in CRFSuite
This has been tested on Ubuntu 14.04 lts.
In order to compile the program, you need to install the following software:
- boost:
sudo apt-get install libboost-all-dev
- cmake:
sudo apt-get install cmake
The script install_depend.sh
will automatically install CRFSuite and liblbfgs-1.10.
- Install boost C++
- ./install_depend.sh
- cd build && cmake ../src && make
The model and also the training scripts can be found in vita-model
PoS tagging, word segmentation, dictionary generates
echo "Hai nghi phạm Nguyễn Hải Dương và Vũ Văn Tiến" | vita -m model_dir
Output:
Hai M,B-NP,0,h a iz
nghi_phạm M,I-NP,0_5,ng i_ph a mz
Nguyễn_Hải_Dương Nu,I-NP,2_3_0,ng w ie nz_h a iz_d wa ngz
và Cc,B-VP,1,v a
Vũ_Văn_Tiến V,I-VP,2_0_4,v u_v aw nz_t ie nz
Output format:
word PoS,chunking info,tone(s),phoneme(s)
Run vita -h
for more options.
Phoneme analysys (mainly used for text-to-speech)
echo "Hai nghi phạm Nguyễn Hải Dương và Vũ Văn Tiến" | vita_ana -m model_dir
Output:
xx^xx-sil+h=a@0-0/A:xx_0/B:xx-1@0-0&0-0/C:0+3/D:xx-0/E:xx-1/F:M-3/G:0-0/H:1=1@0=3/I:18-3/J:30+5-2
xx^sil-h+a=iz@0-2/A:xx_1/B:0-3@0-0&0-2/C:0+2/D:xx-1/E:M-3/F:M-5/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
sil^h-a+iz=ng@1-1/A:xx_1/B:0-3@0-0&0-2/C:0+2/D:xx-1/E:M-3/F:M-5/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
h^a-iz+ng=i@2-0/A:xx_1/B:0-3@0-0&0-2/C:0+2/D:xx-1/E:M-3/F:M-5/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
a^iz-ng+i=ph@0-1/A:0_3/B:0-2@0-1&1-2/C:5+3/D:M-3/E:M-5/F:Nu-10/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
iz^ng-i+ph=a@1-0/A:0_3/B:0-2@0-1&1-2/C:5+3/D:M-3/E:M-5/F:Nu-10/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
ng^i-ph+a=mz@0-2/A:0_2/B:5-3@1-0&2-1/C:2+4/D:M-3/E:M-5/F:Nu-10/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
i^ph-a+mz=ng@1-1/A:0_2/B:5-3@1-0&2-1/C:2+4/D:M-3/E:M-5/F:Nu-10/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
ph^a-mz+ng=w@2-0/A:0_2/B:5-3@1-0&2-1/C:2+4/D:M-3/E:M-5/F:Nu-10/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
a^mz-ng+w=ie@0-3/A:5_3/B:2-4@0-2&2-2/C:3+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
mz^ng-w+ie=nz@1-2/A:5_3/B:2-4@0-2&2-2/C:3+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
ng^w-ie+nz=h@2-1/A:5_3/B:2-4@0-2&2-2/C:3+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
w^ie-nz+h=a@3-0/A:5_3/B:2-4@0-2&2-2/C:3+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
ie^nz-h+a=iz@0-2/A:2_4/B:3-3@1-1&3-1/C:0+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
nz^h-a+iz=d@1-1/A:2_4/B:3-3@1-1&3-1/C:0+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
h^a-iz+d=wa@2-0/A:2_4/B:3-3@1-1&3-1/C:0+3/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
a^iz-d+wa=ngz@0-2/A:3_3/B:0-3@2-0&4-0/C:1+2/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
iz^d-wa+ngz=v@1-1/A:3_3/B:0-3@2-0&4-0/C:1+2/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
d^wa-ngz+v=a@2-0/A:3_3/B:0-3@2-0&4-0/C:1+2/D:M-5/E:Nu-10/F:Cc-2/G:1-1/H:18=3@1=2/I:10-2/J:30+5-2
wa^ngz-v+a=v@0-1/A:0_3/B:1-2@0-0&0-1/C:2+2/D:Nu-10/E:Cc-2/F:V-8/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
ngz^v-a+v=u@1-0/A:0_3/B:1-2@0-0&0-1/C:2+2/D:Nu-10/E:Cc-2/F:V-8/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
v^a-v+u=v@0-1/A:1_2/B:2-2@0-2&1-2/C:0+3/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
a^v-u+v=aw@1-0/A:1_2/B:2-2@0-2&1-2/C:0+3/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
v^u-v+aw=nz@0-2/A:2_2/B:0-3@1-1&2-1/C:4+3/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
u^v-aw+nz=t@1-1/A:2_2/B:0-3@1-1&2-1/C:4+3/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
v^aw-nz+t=ie@2-0/A:2_2/B:0-3@1-1&2-1/C:4+3/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
aw^nz-t+ie=nz@0-2/A:0_3/B:4-3@2-0&3-0/C:xx+1/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
nz^t-ie+nz=sil@1-1/A:0_3/B:4-3@2-0&3-0/C:xx+1/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
t^ie-nz+sil=xx@2-0/A:0_3/B:4-3@2-0&3-0/C:xx+1/D:Cc-2/E:V-8/F:xx-1/G:18-3/H:10=2@2=1/I:1-1/J:30+5-2
ie^nz-sil+xx=xx@0-0/A:4_3/B:xx-1@0-0&0-0/C:xx+0/D:V-8/E:xx-1/F:xx-0/G:10-2/H:1=1@3=0/I:0-0/J:30+5-2
Output format:
Please read doc/lab_format.pdf
Please use the following Bibtex when you want to cite vita
:
@misc{truong_vita,
author = {Quoc Truong Do},
title = {Vita: A Toolkit for Vietnamese segmentation, chunking, part of speech tagging and morphological analyzer},
url = {http://truongdo.com/vita/},
year = {2015}
}