All or some PPM columns do not add to 1
Mitmischer opened this issue · 1 comments
Describe the bug
The program will fail on imperfect input data.
To Reproduce
Steps to reproduce the behavior:
- Read the following probability matrices:
consensus_pwms_stripped.txt
For example, consider adapting the code here (the CLI should be easy to remove)
import cloup
import re
import seqlogo
import numpy as np
from Bio import motifs
import seqlogo
import re
def parse_meme(content):
mn1 = None
mn2 = None
matches = []
i = -1
weights = []
for line in content.split("\n"):
if i >= 0:
i = i+1
z = re.match("MOTIF (AC.*)\:(.+)\:(.+)\s", line)
if z:
i = 0
mn0 = z.groups()[0]
mn1 = z.groups()[1]
mn2 = z.groups()[2]
print(mn1)
print(mn2)
# read "number line"
if i >=3:
weights.append([float(s.strip()) for s in line.split("\t") if s])
# end of number block
if not line:
weights.pop()
weights = np.array(weights).transpose()
print(weights.sum())
print(weights)
ppm = seqlogo.Ppm(weights)
weights = []
i = -1
@cloup.command()
@cloup.option("--input-jaspar", "-i", type=cloup.File("r"), required=True, help="input (jaspar format)")
def main(input_jaspar):
parse_meme(input_jaspar.read())
if __name__ == "__main__":
main()
The data itself is real(!) data, obtained from https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.1beta/.
Expected behavior
The tool should be more lenient.
The problem is in core.py:71
Why would you ever check for 1e-9 with floating point numbers and the resulting imprecisions?
1e-2 or something similar is way more reasonable.
As I understood, the check is just to make sure that the matrix is not in an completely wrong format (like transposed). This threshold still catches this case more than easily.
Desktop (please complete the following information):
- OS: Windows 11
- Python Version 3.11.5
- NumPy Version 1.26.3
- Pandas Version 2.1.4
- matplotlib Version 3.8.0
Hello @Mitmischer,
Thank you for working with seqlogo
.
I understand that some of the design choices can be frustrating for some users. Can you show me the workflow that you use that creates the error for you?
As an open-source project, I wholeheartedly welcome PRs so long as the code follows Google Python Style Guide. As stated on the README.md
, this project does not support MEME format, but we welcome any additions.
That said, seqlogo
was written to support BIOINF 529: Bioinformatics Concepts and Algorithms at the University of Michigan in the Department of Computational Medicine & Bioinformatics. It was intended as a purely educational tool for a very specific module of that course. That is why the rigid tolerance was set. We expected students to generate using the quickstart or similar means during their coursework. However, the tolerance could be set as an optional argument with a default value. That could allow unexpected end-users more flexibility.
If this does not necessarily fit your needs, I can recommend BioPython's Bio.motifs
package for parsing motif data and Logomaker for a more robust plotting platform of sequence motifs. Both of these are better suited for production-level analysis. The tradeoff is "ease of use" for the end-user.