jmschrei/tfmodisco-lite

Matrix format of pfms in motif pfms directory

due23 opened this issue · 3 comments

due23 commented

Hi!
The format for my pfm files in my motifs/pfm/ directory are of MEME format (example below):

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF MA0002.2 RUNX1
letter-probability matrix: alength= 4 w= 11 nsites= 2000 E= 0
 0.143500  0.248000  0.348000  0.260500
 0.117000  0.242500  0.233500  0.407000
 0.061500  0.536000  0.074500  0.328000
 0.028500  0.000000  0.003500  0.968000
 0.000000  0.037500  0.936000  0.026500
 0.043500  0.063500  0.035000  0.858000
 0.000000  0.000000  0.993500  0.006500
 0.008500  0.021000  0.924000  0.046500
 0.005000  0.200000  0.125500  0.669500
 0.065500  0.231500  0.040500  0.662500
 0.250000  0.079000  0.144500  0.526500
URL http://jaspar.genereg.net/matrix/MA0002.2

However, I'm getting errors from line 146 in report.py when it tries to read these files ValueError: could not convert string 'MEME version 4' to float64 at row 0, column 1. For these pfm files do they just need the matrix of numbers alone? I'm using the JASPAR Core (2022) for vertebrates database and the only download format options they have for the pfms are JASPAR, MEME and TRANSFAC. Could you possible provide an example format for the pfm so I know how to format them accordingly?

Cheers!

Yes, the /pfm/ directory has to have the extracted matrices in numpy array format. I'm currently travelling but when I get back I was going to add a way to just take in the meme file. It was originally programmed to take in both so that it would be fast to load each motif, but ultimately I don't think it should take that much time relative to the TOMTOM time.

Vejni commented

Hi, @jmschrei thanks a lot for this rework, it is much more manageable than the original version. Providing only the batch meme file would be nice, as this is a bit redundant.

In the meantime, @due23 and anyone looking, through some trial and error I got it to work with files having only the numbers:

0.333401	0.598017	0.933644	0.881044	0.027918	0.017398	0.014364	0.021849	0.975926	0.572324	0.346753
0.218895	0.091240	0.025895	0.008092	0.952053	0.961966	0.964394	0.008901	0.006878	0.075258	0.217075
0.219907	0.137973	0.019421	0.089622	0.007081	0.007283	0.011127	0.011531	0.008901	0.263605	0.095893
0.227797	0.172770	0.021040	0.021242	0.012948	0.013352	0.010115	0.957718	0.008295	0.088812	0.340279

where the delimiter is \t (note the shape 4 x m not m x 4).

This should be fixed in v2.0.4. Now, you only need to pass in the MEME file. Your command will look something like:

 modisco report -o test/ -s test/ -i modisco_results.h5 -m motifs.meme.txt

You can get the latest with pip install --upgrade modisco-lite