ServerSideHannes/las

Help needed with understanding x_2.

Closed this issue · 6 comments

x_2 should have shape (Batch-size, no_prev_tokens, No_tokens).
x_2 = np.random.random((1, 12, 16))
When you say number of previous token, what exactly does it mean?

At training time I would know all the tokens, right?

Hi!

This mainly regards inference. Since we don't have all the tokes during inference you will have to predict each token at a time. As you say, during training we have all the previous tokens and can thus feed the entire sequence to the model without any problem.

Let's say I am only training on a single example, and the number of tokens in the utterance is 12
and VOCAB_SIZE is 16.

I am guessing that I would need an embedding layer and you are suggesting to keep the dimensionality of the embedding layer equal to the VOCAB_SIZE?

Or I should keep embedding layer size tuneable but I should reduce the embeddings to VOCAB_SIZE by using a dense layer and then generate x_2?

Also, Hi!

I am not sure that you would need an embedding layer. From what you describe "number of tokens in the utterance is 12 and VOCAB_SIZE is 16." the model should be plug and play.

Can you perhaps explain why you would need to use embedding layers for feeding the one hot encoded vectors?

Edit: The model will always return a sequence, so if you feed it 12 tokens it will return 12 tokens. Here is a simple illustrations https://www.tensorflow.org/tutorials/text/images/text_generation_sampling.png.

During training we use ground truth as input tokens but during inference we don't know the ground truth so we must use the previously predicted tokens and inputs to the model.

x_2 = np.random.random((1, 12, 16))

Since you sampled x_2 from uniform random distribution, the entries are not one-hot encoded.

Just by printing x_2, I though x_2 was to be generated from an embedding layer.

Edit: "number of tokens in the utterance is 12 and VOCAB_SIZE is 16." is just for illustration purpose, I intentionally took the dims to be same as your example.

Thank you, It was poorly explained by me. I will better explain this in the readme.

The random matrices are just shape examples not value examples. The token vectors should be represented as a one hot vector. Did I make myself more clear now or did I miss something?

Yes! It's clear. I have another problem with the Attention Context, but I'll open another issue for that. Thanks closing this.