harvardnlp/seq2seq-attn

bad argument #2 to '?' (end index out of bound) error

hanskrupakar opened this issue · 5 comments

I followed all the steps specified in your README.md to try to implement a baseline RNN Attention Encoder Decoder model by Luong et al. 2015. I had no problems until I reached the actual training part.
When I give the train.lua script, I get this error. How do I fix it to run the model as it should?

I have Nvidia GeForce 650M 2 GB GPU with 384 cores. I have cuda 7.5 and cudnn 4. Please help

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/d
emo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...  
loading data... 
done!   
Source vocab size: 50004, Target vocab size: 50004  
Source max sent len: 50, Target max sent len: 52    
Number of additional features on source side: 0 
Switching on memory preallocation   
Number of parameters: 54338004 (active: 54338004)   
/home/hans/torch/install/bin/luajit: bad argument #2 to '?' (end index out of bound)
stack traceback:
    [C]: at 0x7f558238c530
    [C]: in function '__index'
    train.lua:394: in function 'train_batch'
    train.lua:745: in function 'train'
    train.lua:1071: in function 'main'
    train.lua:1074: in main chunk
    [C]: in function 'dofile'
    ...hans/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405d50

can you try passing in -max_batch_l when you run train.lua?

The same error persists with the first line of the stack traceback alone different, when I have a -max_batch_l of 32 :

stack traceback:
    [C]: at 0x7faa6b4f2530

I should also mention, I have added a custom word2vec embedding in for both the encoder and the decoder side and the array index of those arrays are starting from 0 in the hdf5 file. But when I remove these files as input from the pre_word_vecs_enc and pre_word_vecs_dec, I am getting the same error with this alone different:

stack traceback:
    [C]: at 0x7f0395f28530

Should I make changes here? This is the way I created the 2 hdf5 files:

vec = []
with open("demo.src.dict", 'r') as f:
    for line in f.readlines():
            t = np.array([modeleng[line.split(' ')[0].strip()]])
            vec.append(t)

english=np.reshape(np.array(vec),(-1,embed_size))
# gives (50004,300) numpy array, embed_size = 300 
# (I reduced it to run minimal example first)

vec=[]
with open("demo.targ.dict", 'r') as f:
    for line in f.readlines():
            t = np.array([modeltam[line.split(' ')[0].strip().decode('utf-8')]])
            vec.append(t)

tamil = np.reshape(np.array(vec),(-1,embed_size))
# gives (50004,500) numpy array

if not os.path.isfile('src_wv_%d.hdf5'%(embed_size)):
    with h5py.File('src_wv_%d.hdf5'%(embed_size), 'w') as hf:
        hf.create_dataset('word_vecs', data=english)

if not os.path.isfile('targ_wv_%d.hdf5' %(embed_size)):
    with h5py.File('targ_wv_%d.hdf5'%(embed_size), 'w') as hf:
        hf.create_dataset('word_vecs', data=tamil)


hmm, did you use --batchsize 32 when running preprocess.py?

Okay I forgot to change it in preprocess.py as well. I changed it there as well and it works. Silly mistake. Thanks so much. I will close the issue.

Can you explain the reason behind the error, why setting the batch size fixed it?

I also wanted to ask if there is any way that saving the model to file can be done in units smaller than 1 epoch (preferably in terms of the batches)? My running is slow, takes about an hour for an epoch and I want to save to file once for every 50 batches.

@hanskrupakar - we are woking on the intermediate saving - it will be based on a time period (every N minutes) and with the corresponding key-turn "restart" so that runs can be restarted easily from any checkpoint. We will be pushing the feature soon!