srvk/srvk-eesen-offline-transcriber

Recreating the lang_phn_test_test_newlm LM

prashantserai opened this issue · 5 comments

I'm very grateful for this package and for the kind of tools and documentation, and especially for help that I'm receiving on this forum, through my own questions and through others'.

I followed the instructions on http://speechkitchen.org/kaldi-language-model-building/ and tried building a language model using the instructions at "Adapting your own Language Model for EESEN-tedlium". Following those instructions without touching example_txt or wordlist.txt in that folder didn't seem to result into the same language model as was originally present in "lang_phn_test_test_newlm", though.

The one I got in "lang_phn_test_test_newlm" seems to be significantly inferior compared to the original one available in "lang_phn_test_test_newlm" before I overwrote those files. For one specific recording, I was getting WERs around 24% with the original "lang_phn_test" and 25% with the original "lang_phn_test_test_newlm". Now, after running the adaptation scripts (but not modifying either example_txt or wordlist.txt) I got WERs around 35%!

After I actually adapted the example_txt and wordlist.txt, I could improve over 35% but it's still quite worse than 25% (the extent of it being inversely proportional to the extent of cheating done).

If I could figure out how to get to 25% without adaptation and use adaptation to improve on top of that, it might be beneficial.

Thanks!

Following those instructions without touching example_txt or wordlist.txt in that folder didn't seem to result into the same language model as was originally present in "lang_phn_test_test_newlm", though.

You're quite right, this gets built from a different, smaller example text training file. In some cases this method is actually preferable and, for example, requires less memory to do decoding.

The reason for this is that to train an LM on a very large example text can take a lot of RAM: over 100 GB. If you have access to that kind of memory, then maybe we could go with the original LM which I believe was provided by Cantab research, and trained using an enormous-RAM machine on AWS (or a supercompute cluster).

Another improvement would be training a 4-gram vs. a tri-gram model, without pruning, but the resulting graph is so huge that using it to decode would then ALSO require a huge amount of memory. A trick is to use just the Grammar portion of such a very-large model to do rescoring, and is outlined in this commented-out section of the Vagrantfile (which includes downloading the 3.5 GB G.fst.projected grammar):

    # Uncomment for optional large language model rescoring
    # produces generally 2% better Word Error Rates at the epense of longer
    # decoding time and memory requirements (Requires guest VM setting
    # of at least vbox.memory = 15360, just barely fitting in a 16GB host
    # computer - with warnings)   Substitute "make -f Makefile.rescore"
    # for "make" in run scripts (speech2text.sh and friends) to use this.
    # 
    # cd /home/${user}
    # wget http://speechkitchen.org/vms/Data/rescore-eesen.tgz
    # tar zxvf rescore-eesen.tgz
    # rm rescore-eesen.tgz  

Here is a breakdown of 3 possible decode graphs based on different LM build techniques

vagrant@vagrant:~/eesen/asr_egs/tedlium$ ls -l `find . -name TLG.fst`
-rw-rw-r-- 1 vagrant vagrant  38828494 Jun  6  2016 ./v2-30ms/data/lang_phn_test_test_newlm/TLG.fst
-rw-r--r-- 1 vagrant vagrant 642791394 Jun  6  2016 ./v2-30ms/data/lang_phn_test/TLG.fst
-rw-rw-r-- 1 vagrant vagrant  41789230 Dec 14 19:59 ./v2-30ms/lm_build/data/lang_phn_test/TLG.fst
  1. 38828494 2016-06-06 14:23 v2-30ms/data/lang_phn_test_test_newlm/TLG.fst
    This was built from an earlier version of the example lm_build/ instructions and adaptation example_txt data. Specifically, from running directly train_lms.sh. It's small, and therefore can be used to decode on a VM with very low RAM requirements

  2. 642791394 2016-06-06 14:15 v2-30ms/data/lang_phn_test/TLG.fst
    bigger, built from CANTAB language model training text on a bigger machine, and requires more RAM to decode. Likely you can't make one of these yourself in a VM, unless you run on a nice beefy AWS EC2 Instance. We don't have a script to automatically generate this in lm_build/ but the code to make it might be in the Eesen tedlium experiment - pay attention to tedlium_decode_graph.sh in this part of the experiment in run_ctc_phn.sh

  echo =====================================================================
  echo "             Data Preparation and FST Construction                 "
  echo =====================================================================
  # If you have downloaded the data (e.g., for Kaldi systems), then you can
  # simply link the db directory to here and skip this step
  local/tedlium_download_data.sh || exit 1;

  # Use the same data preparation script from Kaldi
  local/tedlium_prepare_data.sh --data-dir db/TEDLIUM_release2 || exit 1

  # Construct the phoneme-based lexicon
  local/tedlium_prepare_phn_dict.sh || exit 1;

  # Compile the lexicon and token FSTs
  utils/ctc_compile_dict_token.sh data/local/dict_phn data/local/lang_phn_tmp data/lang_phn || exit 1;

  # Compose the decoding graph
local/tedlium_decode_graph.sh data/lang_phn || exit 1;
  1. 41789230 Dec 14 19:59 ./v2-30ms/lm_build/data/lang_phn_test/TLG.fst
    This one should be result of running run_adapt.sh yourself, and should not overwrite 1. or 2. above.
    It should be comparable to 1. above. Though since updating the way data is held-out when generating the LM, no longer identical. THIS is where your observation is useful, maybe built this way, the results are no longer as-good or comparable to 1.

Now that you bring this to our attention, it is worth verifying the provided download-able decoding graph built from provided LM (1.) is comparable, if not the same as one generated following the lm_build instructions (3.). If you need to download it again, by the way, the URL is http://speechkitchen.org/vms/Data/v2-30ms.tgz

The LM building process should not have overwritten that. It may be a result of an older version, in which case we suggest you start in a new VM or update from https://github.com/srvk/lm_build

Good luck, and thanks for your feedback!

Another observation: in (3.) above, the language model is actually a bit bigger because 10% of the example_txt data was not going into part of the LM even though it totally could have... so we updated the scripts to include it. This should make it better than (1.) however, not worse.

And for your example, well it's entirely possible that you found an example of audio whose text consists of words that occur in orders that just aren't covered by the example_txt. What would be interesting (but in a way is 'cheating') is to include your text with the example_txt. However including it only once may produce no change in results, and repeating it numerous times (to increase statistical likelihoods in the decoding graph) is definitely cheating. :-/

So I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here

cd ~/eesen/asr_egs/tedlium/v2-30ms/lm_build
./train_lms.sh example_txt local_lm
cd ..
lm_build/utils/decode_graph_newlm.sh data/lang_phn_test

Was that a bad idea?

Anyway, I now tried run_adapt.sh with original files and the resulting language model gave me transcriptions very close to what I had gotten with the pre-existing language model at v2-30ms/data/lang_phn_test_test_newlm/TLG.fst

So essentially I could sorta recreate that LM, and hope to improve over that baseline with adaptation now, so I'm happy about that.

That wasn't a bad idea, just an old idea. run_adapt.sh builds upon that with a simple means to add new vocabulary words and automatic pronunciation lookup, with side effects of producing a list of most-frequent OOV words you might wish to add pronunciations for, either manually or automatically.

We've had some folks who tried very hard to add custom (non-dictionary) words and had difficulty getting them recognized, but what finally worked was the right phonetic pronunciation manually added to the dictionary - because pronunciation matters. :)

You're exactly right, you should have been able to produce nearly-identical results by following the steps in run_adapt.sh - given the stochastic nature of LM building, perhaps not IDENTICAL. Good to see you're able to reproduce the baseline.

I think the key issue I had with the old idea was that, it was not reproducing the baseline. The results were not nearly identical, they were much worse. Just wanted to point that out once again. Having said that, so long as I'm able to get things working using run_adapt.sh, I don't care much if the older recipe works or not.

And yeah, thanks for reminding me about the pronunciations point

PS: I created a separate thread about another issue I faced with the run_adapt.sh and even the older recipe here.