Error in evaluation of all three datasets

Question

Error in evaluation of all three datasets

uditsaxena opened this issue 6 years ago · 15 comments

I'm now updating the issue after a bit of debugging.

This is the problem while training:

Traceback (most recent call last):
  File "src/models/main.py", line 90, in <module>
    main(args)
  File "src/models/main.py", line 62, in main
    evpi(word_embeddings, vocab_size, word_emb_dim, freeze, args, train, test)
  File "/home/usaxena/work/rcq/ranking_clarification_questions/src/models/evpi.py", line 273, in evpi
    validate(train_fn, 'TRAIN', epoch, train, args)
  File "/home/usaxena/work/rcq/ranking_clarification_questions/src/models/evpi.py", line 206, in validate
    out = val_fn(p, pm, q, qm, a, am, l)
  File "/home/usaxena/anaconda3/envs/rcq/lib/python2.7/site-packages/theano/compile/function_module.py", line 618, in __call__
    storage_map=self.fn.storage_map)
  File "/home/usaxena/anaconda3/envs/rcq/lib/python2.7/site-packages/theano/gof/link.py", line 297, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/usaxena/anaconda3/envs/rcq/lib/python2.7/site-packages/theano/compile/function_module.py", line 607, in __call__
    outputs = self.fn()
ValueError: Input dimension mis-match. (input[1].shape[1] = 128, input[2].shape[1] = 1)
Apply node that caused the error: Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}}(TensorConstant{(1, 1) of -1.0}, InplaceDimShuffle{x,0}.0, HostFromGpu.0, Elemwise{sub,no_inplace}.0, HostFromGpu.0)
Toposort index: 8029
Inputs types: [TensorType(float64, (True, True)), TensorType(int32, row), TensorType(float32, matrix), TensorType(float64, row), TensorType(float32, matrix)]
Inputs shapes: [(1, 1), (1, 128), (128, 1), (1, 128), (128, 1)]
Inputs strides: [(8, 8), (5120, 40), (4, 4), (1024, 8), (4, 4)]
Inputs values: [array([[-1.]]), 'not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{acc_dtype=float64}(Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}}.0)]]

This is the error while training and is the root cause of my previous issue reported.

Answer 1 · 2018-08-07T04:43:52.000Z

I've updated this issue with the correct error message which was a cause of the previous error encountered during evaluation.

@raosudha89

Answer 2 · 2018-08-07T14:10:19.000Z

Hi Udit,

Can you share with me the configuration with which you are running this training? i.e. the values in your run_main.sh script? I am able to run the training successfully on the three datasets on my side.

Answer 3 · 2018-08-08T05:22:45.000Z

Hi Sudha,

This is my run_main.sh script

#!/bin/bash

DATA_DIR=data
EMB_DIR=embeddings
SITE_NAME=askubuntu.com
#SITE_NAME=unix.stackexchange.com
#SITE_NAME=superuser.com
#SITE_NAME=askubuntu_unix_superuser

OUTPUT_DIR=output
SCRIPTS_DIR=src/models
#MODEL=baseline_pq
#MODEL=baseline_pa
#MODEL=baseline_pqa
MODEL=evpi

mkdir -p $OUTPUT_DIR

THEANO_FLAGS=floatX=float32,device=gpu python $SCRIPTS_DIR/main.py \
--post_ids_train $DATA_DIR/$SITE_NAME/post_ids_train.p \
--post_vectors_train $DATA_DIR/$SITE_NAME/post_vectors_train.p \
--ques_list_vectors_train $DATA_DIR/$SITE_NAME/ques_list_vectors_train.p \
--ans_list_vectors_train $DATA_DIR/$SITE_NAME/ans_list_vectors_train.p \
--post_ids_test $DATA_DIR/$SITE_NAME/post_ids_test.p \
--post_vectors_test $DATA_DIR/$SITE_NAME/post_vectors_test.p \
--ques_list_vectors_test $DATA_DIR/$SITE_NAME/ques_list_vectors_test.p \
--ans_list_vectors_test $DATA_DIR/$SITE_NAME/ans_list_vectors_test.p \
--word_embeddings $EMB_DIR/word_embeddings.p \
--batch_size 128 --no_of_epochs 20 --no_of_candidates 10 \
--test_predictions_output $DATA_DIR/$SITE_NAME/test_predictions_${MODEL}.out \
--stdout_file $OUTPUT_DIR/${SITE_NAME}.${MODEL}.out \
--model $MODEL \

I had to correct the model_predictions_filename in src/evaluation/run_evaluation.sh - it ended with *.model.out.epoch13, but everything else is running as it is in the repo.

Other details :

When I run it on my cluster, I load my python2.7 env from anaconda. I can try just loading certain modules. Let me know if you need any more information from my end.
Edit: I just ran it as part of slurm, without the anaconda env loaded - it gave the same error. Didn't think that would have been the issue, but I wanted to be thorough.

Answer 4 · 2018-08-08T06:38:35.000Z

To figure out what was going wrong, I also ran with the following theano flags:

theano.config.exception_verbosity='high'
theano.config.optimizer='None'

And this was the error:

Traceback (most recent call last):
  File "src/models/main.py", line 90, in <module>
    main(args)
  File "src/models/main.py", line 62, in main
    evpi(word_embeddings, vocab_size, word_emb_dim, freeze, args, train, test)
  File "/home/usaxena/work/rcq/ranking_clarification_questions/src/models/evpi.py", line 276, in evpi
    validate(train_fn, 'TRAIN', epoch, train, args)
  File "/home/usaxena/work/rcq/ranking_clarification_questions/src/models/evpi.py", line 209, in validate
    out = val_fn(p, pm, q, qm, a, am, l)
  File "/home/usaxena/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 618, in __call__
    storage_map=self.fn.storage_map)
  File "/home/usaxena/.local/lib/python2.7/site-packages/theano/gof/link.py", line 297, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/usaxena/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 607, in __call__
    outputs = self.fn()
ValueError: Input dimension mis-match. (input[1].shape[1] = 128, input[2].shape[1] = 1)
Apply node that caused the error: Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}}(TensorConstant{(1, 1) of -1.0}, InplaceDimShuffle{x,0}.0, HostFromGpu.0, Elemwise{sub,no_
inplace}.0, HostFromGpu.0)
Toposort index: 8041
Inputs types: [TensorType(float64, (True, True)), TensorType(int32, row), TensorType(float32, matrix), TensorType(float64, row), TensorType(float32, matrix)]
Inputs shapes: [(1, 1), (1, 128), (128, 1), (1, 128), (128, 1)]
Inputs strides: [(8, 8), (5120, 40), (4, 4), (1024, 8), (4, 4)]
Inputs values: [array([[-1.]]), 'not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{acc_dtype=float64}(Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}}.0)]]

Debugprint of the apply node:
Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}} [@A] <TensorType(float64, matrix)> ''
 |TensorConstant{(1, 1) of -1.0} [@B] <TensorType(float64, (True, True))>
 |InplaceDimShuffle{x,0} [@C] <TensorType(int32, row)> ''
 | |Subtensor{::, int64} [@D] <TensorType(int32, vector)> ''
 |   |<TensorType(int32, matrix)> [@E] <TensorType(int32, matrix)>
 |   |Constant{0} [@F] <int64>
 |HostFromGpu [@G] <TensorType(float32, matrix)> ''
 | |GpuElemwise{Composite{scalar_softplus((-i0))},no_inplace} [@H] <CudaNdarrayType(float32, matrix)> ''
 |   |GpuElemwise{Add}[(0, 0)] [@I] <CudaNdarrayType(float32, matrix)> ''
 |     |GpuDot22 [@J] <CudaNdarrayType(float32, matrix)> ''
 |     | |GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace} [@K] <CudaNdarrayType(float32, matrix)> ''
 |     | | |CudaNdarrayConstant{[[ 0.5]]} [@L] <CudaNdarrayType(float32, (True, True))>
 |     | | |GpuElemwise{Add}[(0, 0)] [@M] <CudaNdarrayType(float32, matrix)> ''
 |     | |   |GpuDot22 [@N] <CudaNdarrayType(float32, matrix)> ''
 |     | |   | |GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace} [@O] <CudaNdarrayType(float32, matrix)> ''
 |     | |   | | |CudaNdarrayConstant{[[ 0.5]]} [@L] <CudaNdarrayType(float32, (True, True))>
 |     | |   | | |GpuElemwise{Add}[(0, 0)] [@P] <CudaNdarrayType(float32, matrix)> ''

On my console output I get the following :

Namespace(
ans_list_vectors_test='data/askubuntu.com/ans_list_vectors_test.p',
ans_list_vectors_train='data/askubuntu.com/ans_list_vectors_train.p', 
ans_max_len=40, batch_size=128, hidden_dim=100, learning_rate=0.001, 
model='evpi', no_of_candidates=10, no_of_epochs=20,
post_ids_test='data/askubuntu.com/post_ids_test.p',
post_ids_train='data/askubuntu.com/post_ids_train.p', post_max_len=300,
post_vectors_test='data/askubuntu.com/post_vectors_test.p',
post_vectors_train='data/askubuntu.com/post_vectors_train.p',
ques_list_vectors_test='data/askubuntu.com/ques_list_vectors_test.p',
ques_list_vectors_train='data/askubuntu.com/ques_list_vectors_train.p', 
ques_max_len=40, rho=1e-05,
stdout_file='output/askubuntu.com.evpi.out',
test_predictions_output='data/askubuntu.com/test_predictions_evpi.out',
word_embeddings='embeddings/word_embeddings.p'
)

word emb dim:  200
vocab_size  253440 , post_max_len  300  ques_max_len  40  ans_max_len  40
generating data
done! Time taken:  2.29615902901
Size of training data:  19945
Size of test data:  2493
Compiling graph...
done! Time taken:  2179.79267192

I hope this helps...

Answer 5 · 2018-08-12T04:31:20.000Z

Hi @raosudha89
I was wondering if you had any updates on this.
Thanks

Answer 6 · 2018-08-14T16:18:45.000Z

Hi Udit,

(Apologies for the delayed response)

I am unable to reproduce this error on my side. I have a few follow-up questions which might help debug this issue:

Does the data generation script run into any error? Perhaps the data is not generated correctly?
I notice that compiling the graph is taking you 2180 secs whereas it takes only 600 secs on my side to compile the graph. Are you running this script on GPU or CPU?
Can you try rerunning the script with "SITE_NAME=askubuntu_unix_superuser" instead of "SITE_NAME=askubuntu.com" in the run_main.sh script? I want to understand if you run into this error for the combined data as well.

Thanks
Sudha

Answer 7 · 2018-08-14T17:42:23.000Z

Hi Sudha,

These are the answers:

I don't use the data generation script - I use the data you have provided. Similarly for the embeddings.
Right now, for debugging the code, I have the flag theano.config.optimizer='None' turned on. Maybe that is the reason. I am running this on the GPU.
I ran into this for all three datasets, run separately. I haven't yet tried this for the combined dataset, but I will and let you know.

Thanks.

Answer 8 · 2018-08-14T17:47:02.000Z

Regarding 1., do you get any error when running sh src/models/run_load_data.sh ? As per src/models/README, you need to run this before running the main script.

Answer 9 · 2018-08-14T17:50:09.000Z

No, I don't get any error there. It is only when I run the training code is the training error with the shape mismatch message thrown.

What operation could this message be pointing to:

Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}} [@A] <TensorType(float64, matrix)> ''
Especially looking at this:

Inputs shapes: [(1, 1), (1, 128), (128, 1), (1, 128), (128, 1)]
Inputs strides: [(8, 8), (5120, 40), (4, 4), (1024, 8), (4, 4)]
Inputs values: [array([[-1.]]), 'not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{acc_dtype=float64}(Elemwise{Composite{((i0 * i1 * i2) + (i0 * i3 * i4))}}.0)]]

Answer 10 · 2018-08-14T18:12:28.000Z

My guess is it is pointing to this line: loss = pq_a_loss + pqa_loss
Since it is adding two values and each of those values has a multiplication operation with TensorType(float64, (True, True)) which corresponds to the labels.

Could you replace this line: "loss = pq_a_loss + pqa_loss" with " loss = pq_a_loss" and see if you still get the error?

Answer 11 · 2018-08-14T18:26:46.000Z

Okay, I will do that and get back to you soon.

Answer 12 · 2018-08-15T07:22:22.000Z

Hi Sudha,

I tried your suggestion. No luck. Same problem and error message. I tried both loss = pq_a_loss and loss=pqa_loss. It didn't work.

Where else do you think this kind of input occurs: [(1, 1), (1, 128), (128, 1), (1, 128), (128, 1)]
As for the code - do you happen to have a pytorch version of it? Tensorflow? Could you point me to any other reliable implementations?

Could you define your environment? I'm running on python 2.7.12, lasagne 0.1 theano 0.7.

Also, in the file src/models/evpi.py, under the function answer_model, if you look at line 42 -

l_post_ques_dense = lasagne.layers.DenseLayer(l_post_ques_denses[-1], num_units=1,\
										nonlinearity=lasagne.nonlinearities.sigmoid)

it looks like this variable is not used before. Is that supposed to happen?
@raosudha89

Answer 13 · 2018-09-20T07:34:33.000Z

@uditsaxena
I have the same problem with you, when I ran sh src/models/run_main.sh :

ValueError: Input dimension mis-match. (input[1].shape[1] = 128, input[2].shape[1] = 1)
Apply node that caused the error: Elemwise{Composite{((i0 * i1 * scalar_softplus((-i2))) + (i3 * i4 * scalar_softplus(i2)))}}(TensorConstant{(1, 1) of -1.0}, InplaceDimShuffle{x,0}.0, Elemwise{add,no_inplace}.0, TensorConstant{(1, 1) of -1.0}, Elemwise{sub,no_inplace}.0)
Toposort index: 1197
Inputs types: [TensorType(float64, (True, True)), TensorType(int32, row), TensorType(float32, matrix), TensorType(float64, (True, True)), TensorType(float64, row)]
Inputs shapes: [(1, 1), (1, 128), (128, 1), (1, 1), (1, 128)]
Inputs strides: [(8, 8), (5120, 40), (4, 4), (8, 8), (1024, 8)]
Inputs values: [array([[-1.]]), 'not shown', 'not shown', array([[-1.]]), 'not shown']
Outputs clients: [[Sum{acc_dtype=float64}(Elemwise{Composite{((i0 * i1 * scalar_softplus((-i2))) + (i3 * i4 * scalar_softplus(i2)))}}.0)]]

I have the configuration:
python 3.6
theano 1.0.2
lasagne 0.1

Dou you solve your problem above?

Answer 14 · 2018-09-20T07:41:42.000Z

@raosudha89
can you give me your environment configuration? thank you.

Answer 15 · 2019-03-21T17:51:43.000Z

Hi @ynuwm and @uditsaxena

(Apologies for the super delayed response. Hope this is still relevant).
Below are my environment config:

Python 2.7.5
Theano 0.9.0dev5
Lasagne 0.2.dev1
Cuda 8.0.44
Cudnn 5.1

Let me know if it works with this.