input_attribution from a Bioseq object
Closed this issue · 4 comments
Hello again @wkopp ,
I'm trying to evaluate some predictions by computing the integrated gradients. I have both a sequence-based model and another one with multi input (sequence and bigwig coverage).
I have a set of fasta sequences (or one-hot encoded numpy arrays) that I would like to test, but since I don't specifically have a set of genomic coordinates, I would like give specific indexes in my data to feed the method (as said in the documentation: The method can either be called, by specifying the region of interest directly by setting chrom, start and end. Alternatively, it is possible to specify the region index. For example, the n^th region of the dataset.
I'm not fully understanding how to set the index given the available arguments:
def input_attribution(model, inputs, # pylint: disable=too-many-locals chrom=None, start=None, end=None):
What I'm just giving is:
input_attribution(model, _data)
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1108, in input_attribution
inp.shape[-2], inp.shape[-1])) for inp in inputs]
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1108, in <listcomp>
inp.shape[-2], inp.shape[-1])) for inp in inputs]
TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'
Where my model is from keras so I wrap it in janggu with Janggu(model.inputs, model.outputs)
and my data is a Bioseq object created from a list of Bio.SeqRecords _data = Bioseq.create_from_seq('seqs', fastafile=_data)
Could you help me with this ?
Best,
Pedro
Hi @PedroBarbosa,
Thank you for your feedback!
I have added an idx
argument to input_attribution
which can be used to specify the n^th sequence of a Bioseq object. It's now possible to select a sequence using idx
or alternatively using chrom
, start
and end
.
Best Wolfgang
I have added an
idx
argument toinput_attribution
which can be used to specify the n^th sequence of a Bioseq object. It's now possible to select a sequence usingidx
or alternatively usingchrom
,start
andend
.
Thanks for adding this feature.
However, now I'm struggling to get the Dataset ready. janggu produces 4D datasets, while my keras model expects a 3D input (N, 101, 4).
seqs = Bioseq.create_from_seq('seqs', fastafile=_data, order=1, fixedlen=101)
print(_seqs.ndim)
4
print(_seqs.shape)
(2, 101, 1, 4)
Once I run the function, I get a tensorflow error:
input_attribution(model, _seqs, idx=0)
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1129, in input_attribution
grad = model._influence([x*step/50 for x in x_in])
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 163, in _influence
pred = self.kerasmodel(tfinput)
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
....
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Computed output size would be negative: -1 [input_size: 1, effective_filter_size: 3, stride: 1] [Op:Conv2D]
By squeezing the data into the expected dims, I get an IndexError, probably because this method just expects 4D Bioseq or Cover objects, not Squeezed or ReducedDim:
_s = SqueezeDim(_seqs, axis=(2, ))
print(_s.shape)
(2, 101, 4)
input_attribution(model, _s, idx=0)
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1166, in input_attribution
m[0][:, lstart:lend, :, :] = output[iout][:, (ostart):(oend), :, :]
IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed
Traceback (most recent call last):
File "test_variantSeq.py", line 25, in <module>
main()
File "test_variantSeq.py", line 21, in main
model_type=args.type)
a = input_attribution(model, _s, idx=0)
File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/model.py", line 1166, in input_attribution
m[0][:, lstart:lend, :, :] = output[iout][:, (ostart):(oend), :, :]
IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed
Would it be possible to fix that ?
Best,
Pedro
Hi Pedro,
Yes, the datasets in janggu produce 4D array-like objects by design. Some functionality that integrates (e.g. the input_attribution) with the network therefore consequently requires this array format as well. I'll have a look how to to adapt this.
In the meantime, one simple way to fix this would be to include a keras.layers.Reshape
layer at the beginning of your network, to reshape the 4D array to a 3D array. This way you can maintain the rest of your model architecture as it is.
Best,
Wolfgang
Hi Pedro,
Yes, the datasets in janggu produce 4D array-like objects by design. Some functionality that integrates (e.g. the input_attribution) with the network therefore consequently requires this array format as well. I'll have a look how to to adapt this.
Ok, I'm keeping an eye on upcoming updates.
In the meantime, one simple way to fix this would be to include a
keras.layers.Reshape
layer at the beginning of your network, to reshape the 4D array to a 3D array. This way you can maintain the rest of your model architecture as it is.
That worked well, thanks.
Best,
Pedro