BUTSpeechFIT/VBx

How to run VBx on other dataset?

ooobsidian opened this issue · 13 comments

Hi there,
I am very interested in VBx, and I want to run VBx on other datasets to get the results of the speaker diarization.
I want to know what to prepare, or what code to run to achieve the above task?
Thank you very much for your reply.

Hi,
I suggest you take a look at run_example.sh
You will see that there are basically two steps: produce x-vectors and cluster x-vectors.
The code assumes you have already produced voice activity detection labels (segments with silence will not be used) which are passed to predict.py under --in-lab-dir
You can find a simple code to run VAD here but I recommend you search for other tools with the keywords "voice activity detection" or "speech activity detection" since there are more sophisticated methods that should work better.

If your data has 8kHz sampling rate, you'll need to switch ResNet101_16kHz by ResNet101_8kHz.

I think the script is more or less self-explanatory but reach out if something is not clear.

Hi fnlandini,
Following your suggestion, I first ran VBx using this script on the VoxConverse.

I set INSTRUCTION to VAD, xvectors, VBx, score in turn. The DER I got at this time is 49.16%, so according to the process in the paper, I want to re-clustering to get better performance, I set INSTRUCTION to global_xvectors, recluster, score_recluster in turn, and the DER is 49.12% at this time. Then I used OVD, set INSTRUCTION to OV_heuristic, score_heuristic in turn, and got a DER of 49.05%. Please let me know if there is a problem with the results, or a problem with the order when running the code, because I didn't get similar results to README.

Hi @ooobsidian
The errors you mention are already too high for the VBx step so there must be some problem at that point. The script you mentioned uses by default the final_VAD files. For example, for the recording abjxc, the segments are

0.17	7.05	speech
8.59	63.98	speech

If I run the xvector extraction and VBx on that file, the two segments are assigned to the same speaker and the score corresponds to 0.67DER.
Could you verify that if you use that VAD you obtain that error?

Sadly @fnlandini, instead of using final_VAD results in the VAD phase, I used the energy-based VAD. For example, for the recording abjxc, the segments are

0.240	6.370	speech
8.580	19.190	speech
19.350	22.440	speech
22.810	35.880	speech
35.940	44.280	speech
44.350	44.520	speech
44.620	47.470	speech
47.850	57.910	speech
58.020	58.520	speech
58.970	62.420	speech
62.550	63.740	speech
64.260	64.840	speech

What confuses me is that it is mentioned in issue and papers that using energy-based VADs will have better performance. Hope to get your reply.

Hi @fnlandini, I used the default VAD and experimented again. The scores for VBx are as follows

File                DER    JER    B3-Precision    B3-Recall    B3-F1    GKT(ref, sys)    GKT(sys, ref)    H(ref|sys)    H(sys|ref)    MI    NMI
---------------  ------  -----  --------------  -----------  -------  ---------------  ---------------  ------------  ------------  ----  -----
abjxc              0.67   1.57            0.98         0.97     0.97             0.56             0.56          0.08          0.10  0.11   0.56
afjiv             66.73  94.00            0.24         0.92     0.38             0.30             0.05          2.30          0.20  0.15   0.16
ahnss             57.06  88.32            0.24         1.00     0.39             0.00             0.00          2.40          0.00  0.00   0.00
aisvi             50.87  94.19            0.39         0.97     0.55             0.38             0.05          1.81          0.06  0.10   0.18
akthc             18.53  60.55            0.66         0.94     0.78             0.31             0.09          0.88          0.15  0.10   0.21
ampme             17.16  72.85            0.70         0.93     0.80             0.56             0.18          0.87          0.21  0.21   0.31
asxwr             49.43  82.79            0.37         1.00     0.54             0.56             0.01          1.63          0.01  0.03   0.12
atgpi              0.45   2.19            0.97         0.97     0.97             0.15             0.15          0.09          0.12  0.03   0.20
... ...
ywcwr             29.16  65.28            0.57         0.95     0.71             0.36             0.04          0.96          0.17  0.09   0.16
zajzs             15.68  57.04            0.70         0.99     0.82             0.00             0.00          0.83          0.03  0.00   0.01
zcdsd             71.67  94.09            0.21         0.99     0.34             0.08             0.00          2.43          0.03  0.01   0.03
zfkap              0.54   5.36            0.97         0.97     0.97             0.94             0.94          0.10          0.10  1.40   0.93
zidwg              0.88  10.44            0.93         0.94     0.93             0.92             0.89          0.25          0.18  1.99   0.90
zmndm              0.57   1.34            0.98         0.98     0.98             0.28             0.28          0.09          0.05  0.04   0.35
zrlyl              5.72  13.12            0.84         0.86     0.85             0.69             0.62          0.49          0.41  0.71   0.61
ztzzr              4.33   7.07            0.89         0.87     0.88             0.63             0.72          0.31          0.34  0.51   0.61
zvmyn              0.12   4.94            0.91         0.94     0.93             0.52             0.52          0.27          0.14  0.21   0.51
zyffh              2.82  36.89            0.91         0.93     0.92             0.87             0.83          0.31          0.22  0.97   0.79
*** OVERALL ***   46.21  85.86            0.42         0.96     0.59             0.96             0.42          1.80          0.12  7.53   0.89

From the results, it can be found that the DER for abjxc is indeed 0.67%, but the DER for other recordings can even reach ~80%. Is this the expected result?

Hi @ooobsidian , there is something fishy here. These are the scores I obtain:

File                DER    JER    B3-Precision    B3-Recall    B3-F1    GKT(ref, sys)    GKT(sys, ref)    H(ref|sys)    H(sys|ref)    MI    NMI
---------------  ------  -----  --------------  -----------  -------  ---------------  ---------------  ------------  ------------  ----  -----
abjxc              0.67   1.57            0.98         0.97     0.97             0.56             0.56          0.08          0.10  0.11   0.56
afjiv              3.02  10.15            0.86         0.89     0.88             0.86             0.83          0.35          0.31  2.09   0.86
ahnss              5.32  10.55            0.81         0.93     0.86             0.91             0.74          0.69          0.20  1.71   0.80
aisvi              0.86  10.53            0.93         0.95     0.94             0.92             0.89          0.22          0.17  1.69   0.90
akthc              1.48   4.14            0.91         0.94     0.92             0.82             0.75          0.28          0.17  0.70   0.76
ampme              1.54   3.42            0.94         0.93     0.93             0.82             0.83          0.19          0.21  0.89   0.82
asxwr              0.43   2.83            0.94         0.98     0.96             0.97             0.91          0.22          0.04  1.45   0.92
atgpi              0.45   2.19            0.97         0.97     0.97             0.15             0.15          0.09          0.12  0.03   0.20
... ...
ywcwr              1.50   4.21            0.95         0.94     0.95             0.87             0.89          0.17          0.20  0.88   0.83
zajzs              2.44   8.61            0.92         0.97     0.94             0.88             0.73          0.26          0.10  0.57   0.76
zcdsd              1.46   3.39            0.94         0.98     0.96             0.97             0.93          0.24          0.08  2.20   0.93
zfkap              0.54   5.36            0.97         0.97     0.97             0.94             0.94          0.10          0.10  1.40   0.93
zidwg              0.88  10.44            0.93         0.94     0.93             0.92             0.89          0.25          0.18  1.99   0.90
zmndm              0.57   1.34            0.98         0.98     0.98             0.28             0.28          0.09          0.05  0.04   0.35
zrlyl              5.72  13.12            0.84         0.86     0.85             0.69             0.62          0.49          0.41  0.71   0.61
ztzzr              4.33   7.07            0.89         0.87     0.88             0.63             0.72          0.31          0.34  0.51   0.61
zvmyn              0.12   4.94            0.91         0.94     0.93             0.52             0.52          0.27          0.14  0.21   0.51
zyffh              2.82  36.89            0.91         0.93     0.92             0.87             0.83          0.31          0.22  0.97   0.79
*** OVERALL ***    4.41  19.61            0.88         0.93     0.90             0.93             0.88          0.41          0.23  8.93   0.97

There are quite many files for which we have the same error so I suspect that perhaps some of the x-vector extractions failed. One way of checking this could be to count how many segments were produced (I ran wc -l xvectors/segments/*):

     250 xvectors/segments/abjxc
     501 xvectors/segments/afjiv
    2763 xvectors/segments/ahnss
    1871 xvectors/segments/aisvi
     424 xvectors/segments/akthc
     510 xvectors/segments/ampme
     976 xvectors/segments/asxwr
     479 xvectors/segments/atgpi
     ... ...
      513 xvectors/segments/ywcwr
     780 xvectors/segments/zajzs
    2377 xvectors/segments/zcdsd
     410 xvectors/segments/zfkap
     684 xvectors/segments/zidwg
    1123 xvectors/segments/zmndm
    1863 xvectors/segments/zrlyl
     832 xvectors/segments/ztzzr
     447 xvectors/segments/zvmyn
     939 xvectors/segments/zyffh

Could you check you have the same counts? If not, it is possible that the extraction failed (for example because the script ran out of memory) and continued with the next one. The diarization step will not fail but you will have many less segments which will correspond to missed speech when calculating DER.
Let me know if this was the problem. If so, you can rerun the x-vector extraction for the failed files and then rerun the diarization step.

Hello @fnlandini, I checked the number of segments of the x-vector and the result is as follows:

$ wc -l xvectors/segments/*
 250 xvectors/segments/abjxc
 501 xvectors/segments/afjiv
2763 xvectors/segments/ahnss
1871 xvectors/segments/aisvi
 424 xvectors/segments/akthc
 510 xvectors/segments/ampme
 976 xvectors/segments/asxwr
 479 xvectors/segments/atgpi
 700 xvectors/segments/aufkn
 802 xvectors/segments/azisu
1792 xvectors/segments/bauzd
3677 xvectors/segments/bdopb
 199 xvectors/segments/bkwns
1050 xvectors/segments/blwmj
1593 xvectors/segments/bravd
1308 xvectors/segments/bspxd
 266 xvectors/segments/bwzyf
1655 xvectors/segments/bxpwa
1143 xvectors/segments/bydui
 766 xvectors/segments/ccokr
2522 xvectors/segments/cjfer
1904 xvectors/segments/cmfyw
2366 xvectors/segments/cmhsm
 311 xvectors/segments/cobal
 694 xvectors/segments/cqaec
1128 xvectors/segments/crixb
 547 xvectors/segments/cwryz
 501 xvectors/segments/cyyxp
3835 xvectors/segments/czlvt
3191 xvectors/segments/dbugl
1110 xvectors/segments/dhorc
 677 xvectors/segments/djngn
1848 xvectors/segments/djqif
 820 xvectors/segments/dscgs
1755 xvectors/segments/dvngl
2178 xvectors/segments/eapdk
1271 xvectors/segments/edixl
 559 xvectors/segments/ehpau
1853 xvectors/segments/epdpg
 667 xvectors/segments/eqttu
 815 xvectors/segments/esrit
2158 xvectors/segments/evtyi
 358 xvectors/segments/exymw
 624 xvectors/segments/eziem
 830 xvectors/segments/ezsgk
1530 xvectors/segments/falxo
 687 xvectors/segments/femmv
2149 xvectors/segments/fkvvo
 785 xvectors/segments/fsaal
2669 xvectors/segments/fvyvb
 211 xvectors/segments/fxgvy
 504 xvectors/segments/ggvel
1004 xvectors/segments/gocbm
1561 xvectors/segments/gofnj
2574 xvectors/segments/goyli
 748 xvectors/segments/gpjne
 555 xvectors/segments/gqbvk
1371 xvectors/segments/gqdxy
1618 xvectors/segments/grzbb
 209 xvectors/segments/gwtwd
 874 xvectors/segments/gzvkx
3507 xvectors/segments/hgdez
1753 xvectors/segments/hgeec
 347 xvectors/segments/hiyis
3719 xvectors/segments/hkzpa
1135 xvectors/segments/houcx
  86 xvectors/segments/hqyok
 958 xvectors/segments/hycgx
 402 xvectors/segments/ikgcq
1398 xvectors/segments/imbqf
 544 xvectors/segments/imtug
1262 xvectors/segments/ioasm
1350 xvectors/segments/ipqqq
 749 xvectors/segments/iqbww
 451 xvectors/segments/iqtde
 910 xvectors/segments/irvat
 699 xvectors/segments/iwdjy
3802 xvectors/segments/jcako
 581 xvectors/segments/jhdav
 299 xvectors/segments/jiqvr
 574 xvectors/segments/jnivh
 398 xvectors/segments/jsdmu
 429 xvectors/segments/jsmbi
 949 xvectors/segments/jtagk
1799 xvectors/segments/jyflp
 346 xvectors/segments/jyirt
2421 xvectors/segments/jynhe
 481 xvectors/segments/kbkon
1417 xvectors/segments/kckqn
 423 xvectors/segments/kctgl
3112 xvectors/segments/kdfqk
1472 xvectors/segments/kefgo
2949 xvectors/segments/kiadt
2468 xvectors/segments/kkghn
3209 xvectors/segments/kklpv
1901 xvectors/segments/kkwkn
 651 xvectors/segments/kszpd
3917 xvectors/segments/ktzmw
2195 xvectors/segments/kuduk
2015 xvectors/segments/ldkmv
4317 xvectors/segments/ldnro
1115 xvectors/segments/lfzib
 290 xvectors/segments/lknjp
 757 xvectors/segments/luvfz
2368 xvectors/segments/mdbod
3560 xvectors/segments/mekog
 652 xvectors/segments/mesob
 365 xvectors/segments/mevkw
2242 xvectors/segments/mgpok
 714 xvectors/segments/migzj
 796 xvectors/segments/mjgil
2404 xvectors/segments/mkrcv
 324 xvectors/segments/mpvoh
2525 xvectors/segments/mqxsf
1475 xvectors/segments/mvjuk
1791 xvectors/segments/mwfmq
 451 xvectors/segments/nctdh
2898 xvectors/segments/ndkwv
2924 xvectors/segments/nfqjx
 553 xvectors/segments/ngyrk
 600 xvectors/segments/nnqfq
1591 xvectors/segments/nrogz
1598 xvectors/segments/ntchr
 693 xvectors/segments/nxgad
2357 xvectors/segments/odkzj
 580 xvectors/segments/oekmc
 291 xvectors/segments/oenox
3795 xvectors/segments/oklol
 938 xvectors/segments/onpra
1171 xvectors/segments/ooxnm
1542 xvectors/segments/oxxwk
1366 xvectors/segments/paibn
1208 xvectors/segments/pgkde
1959 xvectors/segments/pilgb
 302 xvectors/segments/plbbw
1374 xvectors/segments/pnook
1750 xvectors/segments/pnyir
1575 xvectors/segments/ppgjx
  92 xvectors/segments/pqmho
2013 xvectors/segments/praxo
1301 xvectors/segments/qfdpp
 557 xvectors/segments/qhesr
 466 xvectors/segments/qjgpl
4252 xvectors/segments/qouur
 268 xvectors/segments/qppll
 104 xvectors/segments/qpylu
 158 xvectors/segments/qrzjk
 591 xvectors/segments/qsfzo
 773 xvectors/segments/qvtia
 458 xvectors/segments/qydmg
1721 xvectors/segments/qygfk
 880 xvectors/segments/qzwxa
 736 xvectors/segments/rcxzg
 168 xvectors/segments/rtvuw
 849 xvectors/segments/rxgun
2601 xvectors/segments/sduml
 256 xvectors/segments/sikkm
 538 xvectors/segments/sldwj
1289 xvectors/segments/sosnj
 636 xvectors/segments/spzmn
 448 xvectors/segments/sqkup
 944 xvectors/segments/suuxu
 248 xvectors/segments/syiwe
 192 xvectors/segments/szsyz
1500 xvectors/segments/tcwsn
 111 xvectors/segments/tfvyr
 976 xvectors/segments/tguxv
 579 xvectors/segments/tiams
2534 xvectors/segments/tjkfn
 575 xvectors/segments/tlprc
 943 xvectors/segments/tplwz
  54 xvectors/segments/tucrg
1751 xvectors/segments/txcok
 477 xvectors/segments/uatlu
2088 xvectors/segments/udjij
1233 xvectors/segments/uexjc
 865 xvectors/segments/ufpel
1912 xvectors/segments/ulriv
 236 xvectors/segments/usbgm
 680 xvectors/segments/uvnmy
1324 xvectors/segments/vbjlx
2901 xvectors/segments/vmaiq
2568 xvectors/segments/vmbga
 347 xvectors/segments/vysqj
2481 xvectors/segments/wbqza
 781 xvectors/segments/wdjyj
 502 xvectors/segments/wewoz
 180 xvectors/segments/whmpa
 472 xvectors/segments/willh
 340 xvectors/segments/wjhgf
 382 xvectors/segments/wmori
 940 xvectors/segments/wnfoi
2525 xvectors/segments/wspbh
 372 xvectors/segments/xiglo
 671 xvectors/segments/xmfzh
 940 xvectors/segments/xvllq
2609 xvectors/segments/xxwgv
1190 xvectors/segments/xypdm
 545 xvectors/segments/ycxxe
 803 xvectors/segments/ydlfw
2178 xvectors/segments/yfcmz
2153 xvectors/segments/ylnza
 966 xvectors/segments/ypwjd
2427 xvectors/segments/yrsve
 417 xvectors/segments/ysgbf
2615 xvectors/segments/yuzyu
 513 xvectors/segments/ywcwr
 780 xvectors/segments/zajzs
2377 xvectors/segments/zcdsd
 410 xvectors/segments/zfkap
 684 xvectors/segments/zidwg
1123 xvectors/segments/zmndm
1863 xvectors/segments/zrlyl
 832 xvectors/segments/ztzzr
 447 xvectors/segments/zvmyn
 939 xvectors/segments/zyffh
  273235 total

It's worth mentioning that I used GPU acceleration when extracting the x-vector, which is to change this line to DEVICE = gpu, and I don't know if this will affect the results.

Using GPU should not be a problem, I did the same. Could you please compare your files for afjiv with the following?
afjiv.tar.gz
Also, please run the diarization step from my files to see if you get the same error as me.

I used the x-vector results you extracted and experimented, and got the following results:

File               DER    JER    B3-Precision    B3-Recall    B3-F1    GKT(ref, sys)    GKT(sys, ref)    H(ref|sys)    H(sys|ref)    MI    NMI
---------------  -----  -----  --------------  -----------  -------  ---------------  ---------------  ------------  ------------  ----  -----
afjiv             3.02  10.15            0.86         0.89     0.88             0.86             0.83          0.35          0.31  2.09   0.86
*** OVERALL ***   3.02  10.15            0.86         0.89     0.88             0.86             0.83          0.35          0.31  2.09   0.86

I checked and found that the segments of afjiv are the same as mine, and the size of the ark file is also the same. Is there a problem when extracting x-vector?

I guess there is some problem when extracting the x-vectors for some of the files. One question is if you used the same GPU for all extraction. If not, it could be that there was some problem in some machine. You can also rerun the extraction for the files that have high error.
I am afraid that I cannot help much more here since the code does work properly for some of the files. You will have to find what can be different for the different files. As I said, maybe the machine, or the environment are different. You can also try extracting x-vectors for one file on CPU. It will take longer but you can verify if using CPU gives you reasonable errors.

I would like to confirm two points with you: 1. Did you experiment with https://github.com/BUTSpeechFIT/VBx/blob/v1.1_VoxConverse2020/VoxConverse2020_run.sh? 2. Did you change the pre-trained model and parameters when extracting xvector?

I suspect the problem is the different parameter settings at runtime, because I get the rttm result as follows:

SPEAKER afjiv 1 5.140000 78.530000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 84.810000 1.200000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 86.970000 1.050000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 88.620000 11.790000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 101.040000 3.510000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 105.350000 10.840000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 116.890000 7.300000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 125.180000 4.700000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 130.830000 3.000000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 135.570000 8.340000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 144.570000 0.870000 <NA> <NA> 1 <NA> <NA>

The rttm result from your extracted x-vector is as follows:

SPEAKER afjiv 1 5.140000 35.880000 <NA> <NA> 1 <NA> <NA>
SPEAKER afjiv 1 41.020000 39.840000 <NA> <NA> 2 <NA> <NA>
SPEAKER afjiv 1 80.860000 2.810000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 84.810000 1.200000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 86.970000 1.050000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 88.620000 5.880000 <NA> <NA> 3 <NA> <NA>
SPEAKER afjiv 1 94.500000 5.910000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 101.040000 3.510000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 105.350000 10.840000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 116.890000 2.040000 <NA> <NA> 4 <NA> <NA>
SPEAKER afjiv 1 118.930000 5.260000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 125.180000 4.700000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 130.830000 3.000000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 135.570000 8.340000 <NA> <NA> 6 <NA> <NA>
SPEAKER afjiv 1 144.570000 0.870000 <NA> <NA> 6 <NA> <NA>

It can be found that my rttm contains only one speaker.

When we worked for the challenge we used different models and hyperparameters but the setting that corresponds to our best results is the one shared in that script. I am assuming you are using the exact same x-vector extractor model and hyperparameters. If not, then it is expected that the x-vectors will be different and, thus, the diarization result can be different.

In brief, the answer to both questions is: No, that script only has the setting of our best system. The idea being that if you run it as is, you should get the results we reported.

Thank you very much for your patience!