`fasta-get-markov` implementation?
j-andrews7 opened this issue · 4 comments
Hi there. Was pretty excited to stumble upon this project.
Do you have any plans to implement the fasta-get-markov
functionality to generate a Markov model from sequences and pass to the corresponding FIMO call? I've got a hackathon project coming up in a few months for which this package would be incredibly useful.
I'd offer to implement it myself, but I am not familiar with cython at all. If this functionality is already possible, an example would be much appreciated!
Hi @j-andrews7,
To be fair the MEME suite code has been the most challenging for me to get working in the different projects I've worked on, so that's why I only made an alpha release to date. I'll try to have a look when I can, but I can't guarantee when (I'm on my final PhD year 😅 )!
I feel ya. I'll keep an eye on the repo, good luck in your final year!
Hi @j-andrews7,
In case the hackathon is not passed already (:yum:) I have improved the way background sequences are generated so that you can pass it to FIMO. This is not a complete implementation of fasta-get-markov
, but since FIMO only expects zero-order Markov frequencies that's all I added.
Now FIMO
methods expect a Background
object instead of a raw frequencies array to pass for background frequencies when building the PSSM and/or computing p-values. A Background
object can be created from various constructors:
from pymemesuite.common import Alphabet, Background, Sequence, Array
alphabet = Alphabet.dna()
# explicit constructor from an array of frequencies
bg = Background(dna, Array([0.27, 0.22, 0.22, 0.27]))
# uniform background
bg_uniform = Background.from_uniform(dna)
# frequencies from the nr database
bg_nrdb = Background.from_nrdb(dna)
# load from a background file, e.g. obtained with `fasta-get-markov`
bg_file = Background.from_file(dna, "background.txt")
# compute from given sequences
# (control reverse-complement and pseudocounts with the `rc` and `pseudo` keyword arguments)
bg_seq = Background.from_sequences(dna, Sequence("ATGC"), Sequence("ATTA"))
The last one is what you want, but you'll have to manage the loading from a FASTA file yourself (e.g. with Biopython). Afterwards you can use that background in Motif.build_pssm
or in FIMO.score_motif
.
Still a month off, plenty of time to spare! I will take a look. I also forgot that Biopython itself has motif scanning, which I may also give a try.