krasserm/perceiver-io

Genomic sequences

ajv012 opened this issue · 7 comments

Hello,

Thank you for your implementation of the PerceiverIO project. I am trying to use your work for genomic sequences of shape (10k, 1). I noticed that your model produces the SAME output for DIFFERENT inputs when the num_channels dimension is 1 (I am not using the Fourier Feature encodings). If the outputs are not the same, then they are nominally different. Can you please guide me in solving this issue? Thanks in advance!

Please let me know what additional information you would need to reproduce this bug.

Hi @ajv012, thanks for your interest in this project. Can you provide minimal running code example for reproducing what you observe?

Running the attached file produces the following output where I show that the two sequences are different but their model output on a four class setting is the same:

sequence_1 == sequence_2? False for sequence_1 tensor([[-0.0385, -0.2010, 0.1352, 0.1186]]) for sequence_2 tensor([[-0.0385, -0.2010, 0.1352, 0.1186]])
perceiver_debug.txt

The reason why you observe the same output for different inputs is the low number of input channels (= 1). Before multi-head attention, the input runs though layer norm resulting in values close to zero (i.e. <= 1e-5) if num_channels = 1. These values are used to compute the logits of the attention matrix, resulting in values close to zero too. Computing the softmax from these logits gives identical (low) attention probabilities because of limited FP32 precision. This leads to identical values at each position in the resulting latent array which finally causes the identical output you observe for different inputs.

A perceiver encoder is designed is designed to work on higher dimensional inputs i.e. inputs with a higher number of channels. When you increase the number of input channel from 1 to 256, for example, you'll actually see a different output. So your input adapter should generate an input encoding (either Fourier or a learned embedding) with a reasonable number of channels. And you'll actually want to have this encoding as the position in a genomic sequence actually matters (at least this is my naive assumption).

Thank you so much for the explanation!
The genomic data is one dimensional in its raw format. Would you recommend having a feedforward network in the input adapter that projects the input to a higher dimensional space (training this network, encoder, and decoder in a end-to-end fashion)?

You could even leave the input data as is and concatenate Fourier position encodings as done in ImageInputAdapter except that you'll want a 1D position encoding instead of a 2D but this should be straightforward with the existing utilities. Let me know if you need assistance, I anyway want to support Fourier position encodings for 1D text inputs soon.

If you'd like to use learned position encodings with your input data see TextInputAdapter for an example. I'm not familiar with genomic sequence processing and don't know if an embedding of a vocabulary of size 4 (A, T, C, G) makes sense or if state-of-the-art approaches use something different. Alternatively, concatenating learned position to your input data (instead of adding) should be possible too.

Ah I see the point of confusion. I incorrectly mentioned genomic sequences when I really meant genomic profiles, which do not have A, T, C, G letters, but expression levels of different genes (continuous number), and their mutational status (categorical). In this case, position encodings do not make sense because permuting the values in the genomic profile would not change the output (i.e. position is irrelevant).

My question here pertains to a larger project where I am trying to use the same perceiver model for histology images and genomic profiles. I would be happy to explain my work in detail and get your your guidance on how to proceed with your repo as a starting point. Maybe it makes sense to take this conversation offline?

Thanks for the clarification, this makes sense. Feel free to contact me on another channel for a conversation on how to proceed. Closing this ticket for now. We can open other tickets later for discussing specific implementation options.