How to run VBx on other dataset?
ooobsidian opened this issue · 13 comments
Hi there,
I am very interested in VBx, and I want to run VBx on other datasets to get the results of the speaker diarization.
I want to know what to prepare, or what code to run to achieve the above task?
Thank you very much for your reply.
Hi,
I suggest you take a look at run_example.sh
You will see that there are basically two steps: produce x-vectors and cluster x-vectors.
The code assumes you have already produced voice activity detection labels (segments with silence will not be used) which are passed to predict.py under --in-lab-dir
You can find a simple code to run VAD here but I recommend you search for other tools with the keywords "voice activity detection" or "speech activity detection" since there are more sophisticated methods that should work better.
If your data has 8kHz sampling rate, you'll need to switch ResNet101_16kHz by ResNet101_8kHz.
I think the script is more or less self-explanatory but reach out if something is not clear.
Hi fnlandini,
Following your suggestion, I first ran VBx using this script on the VoxConverse.
I set INSTRUCTION to VAD, xvectors, VBx, score in turn. The DER I got at this time is 49.16%, so according to the process in the paper, I want to re-clustering to get better performance, I set INSTRUCTION to global_xvectors, recluster, score_recluster in turn, and the DER is 49.12% at this time. Then I used OVD, set INSTRUCTION to OV_heuristic, score_heuristic in turn, and got a DER of 49.05%. Please let me know if there is a problem with the results, or a problem with the order when running the code, because I didn't get similar results to README.
Hi @ooobsidian
The errors you mention are already too high for the VBx step so there must be some problem at that point. The script you mentioned uses by default the final_VAD files. For example, for the recording abjxc, the segments are
0.17 7.05 speech
8.59 63.98 speech
If I run the xvector extraction and VBx on that file, the two segments are assigned to the same speaker and the score corresponds to 0.67DER.
Could you verify that if you use that VAD you obtain that error?
Sadly @fnlandini, instead of using final_VAD results in the VAD phase, I used the energy-based VAD. For example, for the recording abjxc, the segments are
0.240 6.370 speech
8.580 19.190 speech
19.350 22.440 speech
22.810 35.880 speech
35.940 44.280 speech
44.350 44.520 speech
44.620 47.470 speech
47.850 57.910 speech
58.020 58.520 speech
58.970 62.420 speech
62.550 63.740 speech
64.260 64.840 speech
What confuses me is that it is mentioned in issue and papers that using energy-based VADs will have better performance. Hope to get your reply.
Hi @fnlandini, I used the default VAD and experimented again. The scores for VBx are as follows
File DER JER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI
--------------- ------ ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- -----
abjxc 0.67 1.57 0.98 0.97 0.97 0.56 0.56 0.08 0.10 0.11 0.56
afjiv 66.73 94.00 0.24 0.92 0.38 0.30 0.05 2.30 0.20 0.15 0.16
ahnss 57.06 88.32 0.24 1.00 0.39 0.00 0.00 2.40 0.00 0.00 0.00
aisvi 50.87 94.19 0.39 0.97 0.55 0.38 0.05 1.81 0.06 0.10 0.18
akthc 18.53 60.55 0.66 0.94 0.78 0.31 0.09 0.88 0.15 0.10 0.21
ampme 17.16 72.85 0.70 0.93 0.80 0.56 0.18 0.87 0.21 0.21 0.31
asxwr 49.43 82.79 0.37 1.00 0.54 0.56 0.01 1.63 0.01 0.03 0.12
atgpi 0.45 2.19 0.97 0.97 0.97 0.15 0.15 0.09 0.12 0.03 0.20
... ...
ywcwr 29.16 65.28 0.57 0.95 0.71 0.36 0.04 0.96 0.17 0.09 0.16
zajzs 15.68 57.04 0.70 0.99 0.82 0.00 0.00 0.83 0.03 0.00 0.01
zcdsd 71.67 94.09 0.21 0.99 0.34 0.08 0.00 2.43 0.03 0.01 0.03
zfkap 0.54 5.36 0.97 0.97 0.97 0.94 0.94 0.10 0.10 1.40 0.93
zidwg 0.88 10.44 0.93 0.94 0.93 0.92 0.89 0.25 0.18 1.99 0.90
zmndm 0.57 1.34 0.98 0.98 0.98 0.28 0.28 0.09 0.05 0.04 0.35
zrlyl 5.72 13.12 0.84 0.86 0.85 0.69 0.62 0.49 0.41 0.71 0.61
ztzzr 4.33 7.07 0.89 0.87 0.88 0.63 0.72 0.31 0.34 0.51 0.61
zvmyn 0.12 4.94 0.91 0.94 0.93 0.52 0.52 0.27 0.14 0.21 0.51
zyffh 2.82 36.89 0.91 0.93 0.92 0.87 0.83 0.31 0.22 0.97 0.79
*** OVERALL *** 46.21 85.86 0.42 0.96 0.59 0.96 0.42 1.80 0.12 7.53 0.89
From the results, it can be found that the DER for abjxc
is indeed 0.67%, but the DER for other recordings can even reach ~80%. Is this the expected result?
Hi @ooobsidian , there is something fishy here. These are the scores I obtain:
File DER JER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI
--------------- ------ ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- -----
abjxc 0.67 1.57 0.98 0.97 0.97 0.56 0.56 0.08 0.10 0.11 0.56
afjiv 3.02 10.15 0.86 0.89 0.88 0.86 0.83 0.35 0.31 2.09 0.86
ahnss 5.32 10.55 0.81 0.93 0.86 0.91 0.74 0.69 0.20 1.71 0.80
aisvi 0.86 10.53 0.93 0.95 0.94 0.92 0.89 0.22 0.17 1.69 0.90
akthc 1.48 4.14 0.91 0.94 0.92 0.82 0.75 0.28 0.17 0.70 0.76
ampme 1.54 3.42 0.94 0.93 0.93 0.82 0.83 0.19 0.21 0.89 0.82
asxwr 0.43 2.83 0.94 0.98 0.96 0.97 0.91 0.22 0.04 1.45 0.92
atgpi 0.45 2.19 0.97 0.97 0.97 0.15 0.15 0.09 0.12 0.03 0.20
... ...
ywcwr 1.50 4.21 0.95 0.94 0.95 0.87 0.89 0.17 0.20 0.88 0.83
zajzs 2.44 8.61 0.92 0.97 0.94 0.88 0.73 0.26 0.10 0.57 0.76
zcdsd 1.46 3.39 0.94 0.98 0.96 0.97 0.93 0.24 0.08 2.20 0.93
zfkap 0.54 5.36 0.97 0.97 0.97 0.94 0.94 0.10 0.10 1.40 0.93
zidwg 0.88 10.44 0.93 0.94 0.93 0.92 0.89 0.25 0.18 1.99 0.90
zmndm 0.57 1.34 0.98 0.98 0.98 0.28 0.28 0.09 0.05 0.04 0.35
zrlyl 5.72 13.12 0.84 0.86 0.85 0.69 0.62 0.49 0.41 0.71 0.61
ztzzr 4.33 7.07 0.89 0.87 0.88 0.63 0.72 0.31 0.34 0.51 0.61
zvmyn 0.12 4.94 0.91 0.94 0.93 0.52 0.52 0.27 0.14 0.21 0.51
zyffh 2.82 36.89 0.91 0.93 0.92 0.87 0.83 0.31 0.22 0.97 0.79
*** OVERALL *** 4.41 19.61 0.88 0.93 0.90 0.93 0.88 0.41 0.23 8.93 0.97
There are quite many files for which we have the same error so I suspect that perhaps some of the x-vector extractions failed. One way of checking this could be to count how many segments were produced (I ran wc -l xvectors/segments/*
):
250 xvectors/segments/abjxc
501 xvectors/segments/afjiv
2763 xvectors/segments/ahnss
1871 xvectors/segments/aisvi
424 xvectors/segments/akthc
510 xvectors/segments/ampme
976 xvectors/segments/asxwr
479 xvectors/segments/atgpi
... ...
513 xvectors/segments/ywcwr
780 xvectors/segments/zajzs
2377 xvectors/segments/zcdsd
410 xvectors/segments/zfkap
684 xvectors/segments/zidwg
1123 xvectors/segments/zmndm
1863 xvectors/segments/zrlyl
832 xvectors/segments/ztzzr
447 xvectors/segments/zvmyn
939 xvectors/segments/zyffh
Could you check you have the same counts? If not, it is possible that the extraction failed (for example because the script ran out of memory) and continued with the next one. The diarization step will not fail but you will have many less segments which will correspond to missed speech when calculating DER.
Let me know if this was the problem. If so, you can rerun the x-vector extraction for the failed files and then rerun the diarization step.
Hello @fnlandini, I checked the number of segments of the x-vector and the result is as follows:
$ wc -l xvectors/segments/*
250 xvectors/segments/abjxc
501 xvectors/segments/afjiv
2763 xvectors/segments/ahnss
1871 xvectors/segments/aisvi
424 xvectors/segments/akthc
510 xvectors/segments/ampme
976 xvectors/segments/asxwr
479 xvectors/segments/atgpi
700 xvectors/segments/aufkn
802 xvectors/segments/azisu
1792 xvectors/segments/bauzd
3677 xvectors/segments/bdopb
199 xvectors/segments/bkwns
1050 xvectors/segments/blwmj
1593 xvectors/segments/bravd
1308 xvectors/segments/bspxd
266 xvectors/segments/bwzyf
1655 xvectors/segments/bxpwa
1143 xvectors/segments/bydui
766 xvectors/segments/ccokr
2522 xvectors/segments/cjfer
1904 xvectors/segments/cmfyw
2366 xvectors/segments/cmhsm
311 xvectors/segments/cobal
694 xvectors/segments/cqaec
1128 xvectors/segments/crixb
547 xvectors/segments/cwryz
501 xvectors/segments/cyyxp
3835 xvectors/segments/czlvt
3191 xvectors/segments/dbugl
1110 xvectors/segments/dhorc
677 xvectors/segments/djngn
1848 xvectors/segments/djqif
820 xvectors/segments/dscgs
1755 xvectors/segments/dvngl
2178 xvectors/segments/eapdk
1271 xvectors/segments/edixl
559 xvectors/segments/ehpau
1853 xvectors/segments/epdpg
667 xvectors/segments/eqttu
815 xvectors/segments/esrit
2158 xvectors/segments/evtyi
358 xvectors/segments/exymw
624 xvectors/segments/eziem
830 xvectors/segments/ezsgk
1530 xvectors/segments/falxo
687 xvectors/segments/femmv
2149 xvectors/segments/fkvvo
785 xvectors/segments/fsaal
2669 xvectors/segments/fvyvb
211 xvectors/segments/fxgvy
504 xvectors/segments/ggvel
1004 xvectors/segments/gocbm
1561 xvectors/segments/gofnj
2574 xvectors/segments/goyli
748 xvectors/segments/gpjne
555 xvectors/segments/gqbvk
1371 xvectors/segments/gqdxy
1618 xvectors/segments/grzbb
209 xvectors/segments/gwtwd
874 xvectors/segments/gzvkx
3507 xvectors/segments/hgdez
1753 xvectors/segments/hgeec
347 xvectors/segments/hiyis
3719 xvectors/segments/hkzpa
1135 xvectors/segments/houcx
86 xvectors/segments/hqyok
958 xvectors/segments/hycgx
402 xvectors/segments/ikgcq
1398 xvectors/segments/imbqf
544 xvectors/segments/imtug
1262 xvectors/segments/ioasm
1350 xvectors/segments/ipqqq
749 xvectors/segments/iqbww
451 xvectors/segments/iqtde
910 xvectors/segments/irvat
699 xvectors/segments/iwdjy
3802 xvectors/segments/jcako
581 xvectors/segments/jhdav
299 xvectors/segments/jiqvr
574 xvectors/segments/jnivh
398 xvectors/segments/jsdmu
429 xvectors/segments/jsmbi
949 xvectors/segments/jtagk
1799 xvectors/segments/jyflp
346 xvectors/segments/jyirt
2421 xvectors/segments/jynhe
481 xvectors/segments/kbkon
1417 xvectors/segments/kckqn
423 xvectors/segments/kctgl
3112 xvectors/segments/kdfqk
1472 xvectors/segments/kefgo
2949 xvectors/segments/kiadt
2468 xvectors/segments/kkghn
3209 xvectors/segments/kklpv
1901 xvectors/segments/kkwkn
651 xvectors/segments/kszpd
3917 xvectors/segments/ktzmw
2195 xvectors/segments/kuduk
2015 xvectors/segments/ldkmv
4317 xvectors/segments/ldnro
1115 xvectors/segments/lfzib
290 xvectors/segments/lknjp
757 xvectors/segments/luvfz
2368 xvectors/segments/mdbod
3560 xvectors/segments/mekog
652 xvectors/segments/mesob
365 xvectors/segments/mevkw
2242 xvectors/segments/mgpok
714 xvectors/segments/migzj
796 xvectors/segments/mjgil
2404 xvectors/segments/mkrcv
324 xvectors/segments/mpvoh
2525 xvectors/segments/mqxsf
1475 xvectors/segments/mvjuk
1791 xvectors/segments/mwfmq
451 xvectors/segments/nctdh
2898 xvectors/segments/ndkwv
2924 xvectors/segments/nfqjx
553 xvectors/segments/ngyrk
600 xvectors/segments/nnqfq
1591 xvectors/segments/nrogz
1598 xvectors/segments/ntchr
693 xvectors/segments/nxgad
2357 xvectors/segments/odkzj
580 xvectors/segments/oekmc
291 xvectors/segments/oenox
3795 xvectors/segments/oklol
938 xvectors/segments/onpra
1171 xvectors/segments/ooxnm
1542 xvectors/segments/oxxwk
1366 xvectors/segments/paibn
1208 xvectors/segments/pgkde
1959 xvectors/segments/pilgb
302 xvectors/segments/plbbw
1374 xvectors/segments/pnook
1750 xvectors/segments/pnyir
1575 xvectors/segments/ppgjx
92 xvectors/segments/pqmho
2013 xvectors/segments/praxo
1301 xvectors/segments/qfdpp
557 xvectors/segments/qhesr
466 xvectors/segments/qjgpl
4252 xvectors/segments/qouur
268 xvectors/segments/qppll
104 xvectors/segments/qpylu
158 xvectors/segments/qrzjk
591 xvectors/segments/qsfzo
773 xvectors/segments/qvtia
458 xvectors/segments/qydmg
1721 xvectors/segments/qygfk
880 xvectors/segments/qzwxa
736 xvectors/segments/rcxzg
168 xvectors/segments/rtvuw
849 xvectors/segments/rxgun
2601 xvectors/segments/sduml
256 xvectors/segments/sikkm
538 xvectors/segments/sldwj
1289 xvectors/segments/sosnj
636 xvectors/segments/spzmn
448 xvectors/segments/sqkup
944 xvectors/segments/suuxu
248 xvectors/segments/syiwe
192 xvectors/segments/szsyz
1500 xvectors/segments/tcwsn
111 xvectors/segments/tfvyr
976 xvectors/segments/tguxv
579 xvectors/segments/tiams
2534 xvectors/segments/tjkfn
575 xvectors/segments/tlprc
943 xvectors/segments/tplwz
54 xvectors/segments/tucrg
1751 xvectors/segments/txcok
477 xvectors/segments/uatlu
2088 xvectors/segments/udjij
1233 xvectors/segments/uexjc
865 xvectors/segments/ufpel
1912 xvectors/segments/ulriv
236 xvectors/segments/usbgm
680 xvectors/segments/uvnmy
1324 xvectors/segments/vbjlx
2901 xvectors/segments/vmaiq
2568 xvectors/segments/vmbga
347 xvectors/segments/vysqj
2481 xvectors/segments/wbqza
781 xvectors/segments/wdjyj
502 xvectors/segments/wewoz
180 xvectors/segments/whmpa
472 xvectors/segments/willh
340 xvectors/segments/wjhgf
382 xvectors/segments/wmori
940 xvectors/segments/wnfoi
2525 xvectors/segments/wspbh
372 xvectors/segments/xiglo
671 xvectors/segments/xmfzh
940 xvectors/segments/xvllq
2609 xvectors/segments/xxwgv
1190 xvectors/segments/xypdm
545 xvectors/segments/ycxxe
803 xvectors/segments/ydlfw
2178 xvectors/segments/yfcmz
2153 xvectors/segments/ylnza
966 xvectors/segments/ypwjd
2427 xvectors/segments/yrsve
417 xvectors/segments/ysgbf
2615 xvectors/segments/yuzyu
513 xvectors/segments/ywcwr
780 xvectors/segments/zajzs
2377 xvectors/segments/zcdsd
410 xvectors/segments/zfkap
684 xvectors/segments/zidwg
1123 xvectors/segments/zmndm
1863 xvectors/segments/zrlyl
832 xvectors/segments/ztzzr
447 xvectors/segments/zvmyn
939 xvectors/segments/zyffh
273235 total
It's worth mentioning that I used GPU acceleration when extracting the x-vector, which is to change this line to DEVICE = gpu
, and I don't know if this will affect the results.
Using GPU should not be a problem, I did the same. Could you please compare your files for afjiv with the following?
afjiv.tar.gz
Also, please run the diarization step from my files to see if you get the same error as me.
I used the x-vector results you extracted and experimented, and got the following results:
File DER JER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI
--------------- ----- ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- -----
afjiv 3.02 10.15 0.86 0.89 0.88 0.86 0.83 0.35 0.31 2.09 0.86
*** OVERALL *** 3.02 10.15 0.86 0.89 0.88 0.86 0.83 0.35 0.31 2.09 0.86
I checked and found that the segments of afjiv are the same as mine, and the size of the ark file is also the same. Is there a problem when extracting x-vector?
I guess there is some problem when extracting the x-vectors for some of the files. One question is if you used the same GPU for all extraction. If not, it could be that there was some problem in some machine. You can also rerun the extraction for the files that have high error.
I am afraid that I cannot help much more here since the code does work properly for some of the files. You will have to find what can be different for the different files. As I said, maybe the machine, or the environment are different. You can also try extracting x-vectors for one file on CPU. It will take longer but you can verify if using CPU gives you reasonable errors.
I would like to confirm two points with you: 1. Did you experiment with https://github.com/BUTSpeechFIT/VBx/blob/v1.1_VoxConverse2020/VoxConverse2020_run.sh? 2. Did you change the pre-trained model and parameters when extracting xvector?
I suspect the problem is the different parameter settings at runtime, because I get the rttm result as follows:
SPEAKER afjiv 1 5.140000 78.530000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 84.810000 1.200000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 86.970000 1.050000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 88.620000 11.790000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 101.040000 3.510000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 105.350000 10.840000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 116.890000 7.300000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 125.180000 4.700000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 130.830000 3.000000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 135.570000 8.340000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 144.570000 0.870000 <NA> <NA> 1 <NA> <NA>
The rttm result from your extracted x-vector is as follows:
SPEAKER afjiv 1 5.140000 35.880000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 41.020000 39.840000 <NA> <NA> 2 <NA> <NA>
SPEAKER afjiv 1 80.860000 2.810000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 84.810000 1.200000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 86.970000 1.050000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 88.620000 5.880000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 94.500000 5.910000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 101.040000 3.510000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 105.350000 10.840000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 116.890000 2.040000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 118.930000 5.260000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 125.180000 4.700000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 130.830000 3.000000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 135.570000 8.340000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 144.570000 0.870000 <NA> <NA> 6 <NA> <NA>
It can be found that my rttm contains only one speaker.
When we worked for the challenge we used different models and hyperparameters but the setting that corresponds to our best results is the one shared in that script. I am assuming you are using the exact same x-vector extractor model and hyperparameters. If not, then it is expected that the x-vectors will be different and, thus, the diarization result can be different.
In brief, the answer to both questions is: No, that script only has the setting of our best system. The idea being that if you run it as is, you should get the results we reported.
Thank you very much for your patience!