lareaulab/iXnos

Dealing with unknown nucleotides (nnn../NNN...) in the reference transcript sequence "transcripts.13cds10.transcripts.fa"

Opened this issue · 0 comments

Hi again,

I've encountered the following error when trying to train my model on my own data :
I start running :

python /home/marina/git-repos/iXnos/iXnos/reproduce_scripts/28mer_models.py \
        s28_cod_n5p4_nt_n15p14 \
        /home/marina/git-repos/iXnos/iXnos/expts/dmso /home/marina/git-repos/iXnos/iXnos/expts/dmso/process/dmso.transcript.mapped.wts.sam \
        /home/marina/git-repos/iXnos/iXnos/genome_data/crigri.transcripts.13cds10.lengths.txt /home/marina/git-repos/iXnos/iXnos/genome_data/crigri.transcripts.13cds10.fa \
        /home/marina/git-repos/iXnos/iXnos/expts/dmso/process/tr_set_bounds.size.28.28.trunc.20.20.min_cts.200.min_cod.100.top.300.txt /home/marina/git-repos/iXnos/iXnos/expts/dmso/process/te_set_bounds.size.28.28.trunc.20.20.min_cts.200.min_cod.100.top.300.txt \
        /home/marina/git-repos/iXnos/iXnos/expts/dmso/process/outputs.size.28.28.txt 35 \
        32

which outputs

s28_cod_n5p4_nt_n15p14
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
[-15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Traceback (most recent call last):
  File "/home/marina/git-repos/iXnos/iXnos/reproduce_scripts/28mer_models.py", line 34, in <module>
    nonlinearity="tanh", widths=[200], update_method="nesterov")
  File "/home/marina/git-repos/iXnos/iXnos/iXnos/interface.py", line 404, in make_lasagne_feedforward_nn
    filter_max=filter_max, filter_pct=filter_pct, filter_test=filter_test)
  File "/home/marina/git-repos/iXnos/iXnos/iXnos/process.py", line 1118, in load_lasagne_data
    max_struc_width=max_struc_width, aa_feats=aa_feats)
  File "/home/marina/git-repos/iXnos/iXnos/iXnos/process.py", line 1158, in get_data_matrices_lasagne
    max_struc_width=max_struc_width, aa_feats=aa_feats)
  File "/home/marina/git-repos/iXnos/iXnos/iXnos/process.py", line 672, in get_X
    for gene in sorted_genes for A_site in codon_set[gene]])
  File "/home/marina/git-repos/iXnos/iXnos/iXnos/process.py", line 1248, in get_rel_cod_feats
    features[i*64 + cod2id[cod]] = 1
KeyError: 'nnn'
makefile:600: recipe for target '/home/marina/git-repos/iXnos/iXnos/expts/dmso/lasagne_nn/s28_cod_n5p4_nt_n15p14/init_data/init_data.pkl' failed
make: *** [/home/marina/git-repos/iXnos/iXnos/expts/dmso/lasagne_nn/s28_cod_n5p4_nt_n15p14/init_data/init_data.pkl] Error 1

I understand that the KeyError : 'nnn' that is thrown is due to the presence of unknown sequences (n(s)) in my reference transcriptome file (that are not included as keys in the python dictionary of codons ). I had thought of just removing those transcripts with unknown sequences from the reference and re-mapping again. However, I have checked some of the transcriptome reference files that are provided in your iXnos/genome_data and I have found that the reference transcriptome file for the Iwasaki experiment (human.transcripts.13cds10.transcripts.fa) also contains nns. I have managed to successfully run Iwasaki's models in my system, so I was wondering if you dealt with the same issue, and being that the case, if you could provide any insights on how to solve it .

Thank you very much for your kind help.

Best,

Marina