BIMSBbioinfo/janggu

input_attribution from a Bioseq object

Closed this issue · 4 comments

Hello again @wkopp ,

I'm trying to evaluate some predictions by computing the integrated gradients. I have both a sequence-based model and another one with multi input (sequence and bigwig coverage).

I have a set of fasta sequences (or one-hot encoded numpy arrays) that I would like to test, but since I don't specifically have a set of genomic coordinates, I would like give specific indexes in my data to feed the method (as said in the documentation: The method can either be called, by specifying the region of interest directly by setting chrom, start and end. Alternatively, it is possible to specify the region index. For example, the n^th region of the dataset.

I'm not fully understanding how to set the index given the available arguments:

def input_attribution(model, inputs, # pylint: disable=too-many-locals chrom=None, start=None, end=None):

What I'm just giving is:

input_attribution(model, _data)

File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1108, in input_attribution
    inp.shape[-2], inp.shape[-1])) for inp in inputs]
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1108, in <listcomp>
    inp.shape[-2], inp.shape[-1])) for inp in inputs]
TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

Where my model is from keras so I wrap it in janggu with Janggu(model.inputs, model.outputs) and my data is a Bioseq object created from a list of Bio.SeqRecords _data = Bioseq.create_from_seq('seqs', fastafile=_data)

Could you help me with this ?
Best,
Pedro

wkopp commented

Hi @PedroBarbosa,

Thank you for your feedback!

I have added an idx argument to input_attribution which can be used to specify the n^th sequence of a Bioseq object. It's now possible to select a sequence using idx or alternatively using chrom, start and end.

Best Wolfgang

I have added an idx argument to input_attribution which can be used to specify the n^th sequence of a Bioseq object. It's now possible to select a sequence using idx or alternatively using chrom, start and end.

Thanks for adding this feature.

However, now I'm struggling to get the Dataset ready. janggu produces 4D datasets, while my keras model expects a 3D input (N, 101, 4).

seqs = Bioseq.create_from_seq('seqs', fastafile=_data, order=1, fixedlen=101)

print(_seqs.ndim)
4

print(_seqs.shape)
(2, 101, 1, 4)

Once I run the function, I get a tensorflow error:

input_attribution(model, _seqs, idx=0)

 File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1129, in input_attribution
    grad = model._influence([x*step/50 for x in x_in])
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 163, in _influence
    pred = self.kerasmodel(tfinput)
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
....
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Computed output size would be negative: -1 [input_size: 1, effective_filter_size: 3, stride: 1] [Op:Conv2D]

By squeezing the data into the expected dims, I get an IndexError, probably because this method just expects 4D Bioseq or Cover objects, not Squeezed or ReducedDim:

 _s = SqueezeDim(_seqs, axis=(2, ))
print(_s.shape)
(2, 101, 4)

input_attribution(model, _s, idx=0)
 File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1166, in input_attribution
    m[0][:, lstart:lend, :, :] = output[iout][:, (ostart):(oend), :, :]
IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed
Traceback (most recent call last):
  File "test_variantSeq.py", line 25, in <module>
    main()
  File "test_variantSeq.py", line 21, in main
    model_type=args.type)
    a = input_attribution(model, _s, idx=0)
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1166, in input_attribution
    m[0][:, lstart:lend, :, :] = output[iout][:, (ostart):(oend), :, :]
IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed

Would it be possible to fix that ?

Best,
Pedro

wkopp commented

Hi Pedro,

Yes, the datasets in janggu produce 4D array-like objects by design. Some functionality that integrates (e.g. the input_attribution) with the network therefore consequently requires this array format as well. I'll have a look how to to adapt this.

In the meantime, one simple way to fix this would be to include a keras.layers.Reshape layer at the beginning of your network, to reshape the 4D array to a 3D array. This way you can maintain the rest of your model architecture as it is.

Best,
Wolfgang

Hi Pedro,

Yes, the datasets in janggu produce 4D array-like objects by design. Some functionality that integrates (e.g. the input_attribution) with the network therefore consequently requires this array format as well. I'll have a look how to to adapt this.

Ok, I'm keeping an eye on upcoming updates.

In the meantime, one simple way to fix this would be to include a keras.layers.Reshape layer at the beginning of your network, to reshape the 4D array to a 3D array. This way you can maintain the rest of your model architecture as it is.

That worked well, thanks.

Best,
Pedro