churchlab/UniRep

Where are the sequences being embedded?

akmorrow13 opened this issue · 7 comments

Could you point out where in the code the amino acid sequences are being embedded to 10 dimensions? This was stated in Alley Et al but I am not able to find it in the code. Thanks!

The embedding happens in all three babbler classes (1900, 256, and 64) in unirep.py, at the following step:

embed_matrix = tf.get_variable(
            "embed_matrix", dtype=tf.float32, initializer=np.load(os.path.join(self._model_path, "embed_matrix:0.npy"))
    )
embed_cell = tf.nn.embedding_lookup(embed_matrix, self._minibatch_x_placeholder)

I always wondered though, what this 10D-embedding is based on. Clustering the embedding dimensions for each AA doesn't reveal any structure at first glance.

Thanks @ElArkk ! Yes, it still is not clear to me from this code how the sequences are actually embedded. From the API docs, tf.nn.embedding_lookup seems to just look up pre-generated embeddings, and does not actually generate the embeddings itself. This still leaves open the question of what the embedding actually is.

So for the embeddings themselves, they come from the embed_matrix:0.npy file (which you need to pull from s3, see "Obtaining weight files" in the readme). Inside the file is a numpy array of shape (26,10). So a 10 dimensional array for each AA variant they consider in the model (see the aa_to_int dictionary in data_utils.py). I think the embedding at index 0 actually never gets called.

Thank you for the clarification @ElArkk . My main question is how is this 26x10 embedding generated? Is there existing code for this?

This point remains a mystery to me as well. Maybe the authors can comment on this?

Thank you, @ElArkk . I will let you know their response.

Hi,

If the question is how the 26x10 embedding matrix is generated ... It's generated like any other tensor in the graph. The initial embedding matrix was randomly initialized. It's then learned with gradient descent, and what's provided in embed_matrix:0.npy is the learned tensor.

tf.nn.embedding_lookup is an embedding op. It's does an parallelized/optimized version of the following procedure:

  1. Convert each integer encoded sequence in the batch to seq_len x 26 one hot encoding.
  2. Multiply this seq_len x 26 one-hot matrix on the right by the 26 x 10 embedding tensor (this is a TF variable that is initially populated with the contents of embed_matrix:0.npy). If we were just beginning training of UniRep, this would be a randomly initialized tensor. This "embeds" each amino acid in the sequence into 10 dimensions.
  3. This operation is carried out for all sequences in the batch (again in a parallelized/optimized fashion).

Note again that because the embedding matrix is a TF variable, it is backpropped over and learned. It is not "pre-generated" in anyway that doesn't involve TF and gradient descent.

Let us know if this doesn't answer your question!