
Using Alpha code0.1 for English audio to Russian Text translation

I am trying to do audio in English to text in Russian translation using the Alpha release(0.1) code.

I am working on experiments/btec_speech folder of Alpha release code.

I am giving the audio and text aligned chapter wise (Each chapter contains around 1500 russian words or 50000 MFCC values ).

I have separated the chapters in such a way that 16 chapters are given for train, 2 for dev and 5 for test.

When I generate the MFCC for each set (ie, train,dev or test) I am concatenating them. ie, the First 4 bytes of the train MFCC feature file will be 16, that of dev will be 2 and that of test will be 5.

I am generating the vocab files from russian text.
Please find attached the file that I am using.

I modified baseline-mono.yaml for training the network.

The modifications that I made in baseline-mono.yaml are

  1. For encoder:
    binary:True #since it is MFCC values
  2. For decoder:
  3. vocab_prefix: vocab

I am pasting the modified baseline-mono.yaml file.

I modified max_len field in both encoder and decoder because, the number of words in russian text for 1 chapter or number of MFCC coefficients in one frame (one frame of MFCC coefficients corresponds to one line in Russian text ) are much more than the max_seq_len already given (It was 25 and 600 ). (In function read_dataset() there is a check for max_seq_len )
So In order to avoid the checking of max_seq_len I modified in the init() of class TranslationModel as 'self.max_len=False' instead of 'self.max_len = dict(zip(self.extensions, self.max_input_len + self.max_output_len)) '.

When I try to train the network with these modifications I am getting the below mentioned error.

11/26 15:56:17 files: experiments/btec_speech/data/ experiments/btec_speech/data/
11/26 15:56:18 size: 2
11/26 15:56:31 starting training
Traceback (most recent call last):
File "/usr/lib/python3.5/", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/", line 85, in _run_code
exec(code, run_globals)
File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/", line 229, in
File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/", line 221, in main
model.train(sess=sess, **config)
File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/", line 368, in train
self.train_step(sess=sess, loss_function=loss_function, use_baseline=use_baseline, **kwargs)
File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/", line 412, in train_step
File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/", line 200, in step
res =, input_feed)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/", line 889, in run
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/", line 1096, in _run
% (np_val.shape,, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (32, 0) for Tensor 'encoder_en:0', which has shape '(?, ?, 41)'

Could you please tell me whether I am doing it in correct way,or should I make any other modification in order to train the network to translate the English audio to Russian Text.

label: 'baseline-mono'
description: "mono-speaker baseline on BTEC"

dropout_rate: 0.5
cell_size: 256
attn_size: 256
embedding_size: 256

layers: 2
bidir: True
use_lstm: True
weight_scale: null

data_dir: experiments/btec_speech/data
model_dir: experiments/btec_speech/model
batch_size: 64

train_prefix: hbfn.train # 'easy' mono-speaker settings

optimizer: 'adam'
learning_rate: 0.001

steps_per_checkpoint: 1000
steps_per_eval: 1000

max_gradient_norm: 5.0
max_steps: 30000
batch_mode: 'standard'
read_ahead: 10
vocab_prefix: vocab

  • name: en
    embedding_size: 41
    layers: 3
    time_pooling: [2, 2]
    pooling_avg: True
    binary: True
    attention_filters: 1
    attention_filter_length: 25
    max_len: False
    input_layers: [256, 256]
    concat_last_states: True
    bidir_projection: True
    trainable_initial_states: False


  • name: ru
    layers: 2
    max_len: False
    maxout: False
    input_attention: False
    use_previous_word: False
    vanilla: False
    state_zero: True
    use_lstm_state: False
    output_extra_proj: False
    attn_prev_word: False
    maxout_stride: null
    convolutions: null

data_dir=${speech_dir}/data # output directory for the processed files (text and audio features)

mkdir -p ${raw_audio_dir} ${data_dir}

scripts/speech/ ${raw_audio_dir}/hbfn_wav16_en/train/* --output ${data_dir}/hbfn.train.en
scripts/speech/ ${raw_audio_dir}/hbfn_wav16_en/dev/* --output ${data_dir}/
scripts/speech/ ${raw_audio_dir}/hbfn_wav16_en/test/* --output ${data_dir}/hbfn.test.en

scripts/ ${data_dir}/hbfn.train ru ${data_dir} --max 0 --lowercase --output vocab --mode vocab


The error you're getting is due to your change inside "". You don't need to change this line. Setting "max_len" to 0 inside the config files is enough.

However, I'm sorry to tell you this, but there is no way you'll be able to train the model with full chapters of length 50000. I'm already having trouble because of memory constraints with sequences of length 1500 (the longer the sequence, the more GPU memory seq2seq requires).

Moreover, having so few samples (e.g.. 5 samples for test) is not a realistic setting. For example, BTEC (which is a very small corpus by deep learning standards) has 2000/1500/900 samples (for train/dev/test).

You'll need to find a way to split your paragraphs into smaller segments (e.g., sentences)
