kundajelab/tfmodisco

AssertionError: Probabilities don't sum to 1 along axis 1

Closed this issue · 2 comments

There is an error message on utils/compute_per_position_ic(ppm, background, pseudocount) function. It says that the probabilities of ppm don't sum to 1 along axis 1. I think it has something to do with the warning I obtained earlier:

RuntimeWarning: invalid value encountered in true_divide
  vecs1/np.linalg.norm(vecs1, axis=1)[:,None]

I'm using grad X input as importance scores and grad alone as hypothetical scores. Is this error message implying that the importance scores are problematic?

Details attached below:

MEMORY 2.821525504
On task task
Computing windowed sums on original
Generating null dist
peak(mu)= -2.2886458784347058e-05
Computing threshold
Thresholds from null dist were -2.4279579957947135 and 1.8365790802054107
Final raw thresholds are -2.4279579957947135 and 1.8365790802054107
Final transformed thresholds are -0.9624493367346939 and 0.9441734693877551

Got 66719 coords
After resolving overlaps, got 66719 seqlets
Across all tasks, the weakest transformed threshold used was: 0.9440734693877552
MEMORY 3.77667584
66719 identified in total
min_metacluster_size_frac * len(seqlets) = 667 is more than min_metacluster_size=100.
Using it as a new min_metacluster_size
2 activity patterns with support >= 667 out of 2 possible patterns
Metacluster sizes: [45352, 21367]
Idx to activities: {0: '1', 1: '-1'}
MEMORY 3.777286144
On metacluster 1
Metacluster size 21367 limited to 20000
Relevant tasks: ('task',)
Relevant signs: (-1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 20000
(Round 1) Computing coarse affmat
MEMORY 3.777286144
Beginning embedding computation
Computing embeddings
Using TensorFlow backend.
Finished embedding computation in 27.73 s
Starting affinity matrix computations
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/modisco/affinitymat/core.py:184: RuntimeWarning: invalid value encountered in true_divide
vecs1/np.linalg.norm(vecs1, axis=1)[:,None],
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/modisco/affinitymat/core.py:187: RuntimeWarning: invalid value encountered in true_divide
vecs2/np.linalg.norm(vecs2, axis=1)[:,None],
Normalization computed in 2.45 s
Cosine similarity mat computed in 13.23 s
Normalization computed in 2.54 s
Cosine similarity mat computed in 13.88 s
Finished affinity matrix computations in 35.55 s
(Round 1) Compute nearest neighbors from coarse affmat
MEMORY 7.015927808
Computed nearest neighbors in 24.5 s
MEMORY 7.261278208
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 7.261278208
Launching nearest neighbors affmat calculation job
MEMORY 7.39745792
Parallel runs completed
MEMORY 7.595122688
Job completed in: 396.55 s
MEMORY 10.60612096
Launching nearest neighbors affmat calculation job
MEMORY 10.60352
Parallel runs completed
MEMORY 10.724483072
Job completed in: 395.72 s
MEMORY 13.735211008
(Round 1) Computed affinity matrix on nearest neighbors in 800.06 s
MEMORY 10.774994944
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/scipy/stats/stats.py:4196: SpearmanRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
warnings.warn(SpearmanRConstantInputWarning())
Filtered down to 422 of 20000
(Round 1) Retained 422 rows out of 20000 after filtering
MEMORY 10.775801856
(Round 1) Computing density adapted affmat
MEMORY 5.975793664
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 422 samples in 0.001s...
[t-SNE] Computed neighbors for 422 samples in 0.002s...
[t-SNE] Computed conditional probabilities for sample 422 / 422
[t-SNE] Mean sigma: 0.883924
(Round 1) Computing clustering
MEMORY 5.975793664
Beginning preprocessing + Leiden
0%| | 0/50 [00:00<?, ?it/s]
Quality: 0.45266841163373195
Quality: 0.45729671160348756
100%|██████████| 50/50 [00:02<00:00, 22.66it/s]
Got 10 clusters after round 1
Counts:
{4: 34, 3: 50, 8: 16, 1: 81, 2: 67, 5: 30, 0: 84, 9: 15, 6: 27, 7: 18}
MEMORY 5.977145344
(Round 1) Aggregating seqlets in each cluster
MEMORY 5.977145344
Aggregating for cluster 0 with 84 seqlets
MEMORY 5.977145344

Trimming eliminated 0 seqlets out of 84

AssertionError: Probabilities don't sum to 1 along axis 1 in [[0.28571429 0.16666667 0.22619048 0.20238095

My suspicion is that there is a problem with your one-hot encoding of the sequence itself. Have you verified that, at every position in your one-hot encoded sequence, exactly one position among ACGT is a 1 and the others are 0? I think you may have some positions where it’s all zeros.

Right, that's my problem. It's working after I removed sequences with all zero entries.
I used Deep SEA dataset downloaded from their official website. I assumed the sequences are valid without checking.
Just a heads up for other people who are also trying modisco on Deep SEA.

Thanks for your amazing work.