chaitjo/personalized-dialog

Questions about split memory network

kazeto111 opened this issue · 6 comments

I read your paper and it is so nice!!!
I have 2 questions about it. (I'm a beginner of this discipline. I'm sorry if they are absurd ones)

  1. with the split memory network, you simply sum vector from conversation history and vector from profile attribute . I thought it is possible that to weight conversation history more is better because generally conversation history has more information than profile attributes. Why you simply sum? Is it possible to learn the weight by the learning of embedding?

  2. Are you using same embedding to A,B and C or different ones? I took a look at codes, but I couldn't understand how the embeddings are distinguished if A,B and C are different.

Hi @kazeto111, thank you for your interest in my work and for the kind words! (Also, please feel free to ask absurd/dumb questions; they often lead to many interesting directions in my experience!)

This work was done over 3 years ago, so take my responses with a grain of salt.

  1. Yes, intuitively, in a dialog, conversation history and the content of conversation probably is more important for designing a chatbot. I believe that we chose to sum the two branches for simplicity of implementation. You are also correct in your understanding that the model may implicitly learn to weigh the embeddings coming from the two branches differently, e.g. by perhaps scaling the embeddings from the profile branch to have a lower magnitude than the conversation branch.

  2. I am not sure about this any more; sorry! The code should be enough to find this out. If memory serves me right, A, B, and C are all tied together, i.e. the same word embedding matrix is used to embed words regardless of weather they come from the dialog or the query, etc.

Thank you so speedy and gracious reply!

I want to ask additionally.

  1. Actually I'm thinking to use split memory network for end-to-end dialogue generation, that is the dialogue without handcrafted utterance candidate. Do you think it is effective to simply replace candidate_dict with vocab_dict and predict words of utterance?

  2. In connection to 2nd question. Do you think leaning A,B and C differently is worth the cost? Feeling is OK

You would probably need to look at generative models for that purpose, e.g. I would suggest the Transformer, which is commonly used today. It follows the encoder-decoder paradigm.

As far as I know, modern NLP systems and models are actually no longer using word embedding matrixes like A, B, C here. They are using something called Byte-pair Encodings (BPEs) for encoding 'chunks' or common n-grams in the language. I think that it is more common to use the same BPE embedding matrix for both encoder and decoder in a dialog encoder-decoder model today (but again, please check).

OTOH, if you are doing machine translation and the language to be encoded or decoded is different, you will use different encoding embeddings and learn them both.

Sorry, I wanted to say I'm thinking to attach split memory network to Transformer based model GPT-2 for example and FTing it. Do you have any insight about this ?

Thank you so much for provision of information about embedding! I didn't familiar with such things!

No, I'm not hands-on with this particular scenario. Good luck!

Thank you so much for your all kindness!!!!!