Could not open *.translation file for reading!
Opened this issue · 4 comments
Describe the issue
When I use RepeatModeler for de novo repeat sequences finding, It said that the program could not open a *.translation file for reading, which was generated in the BuildDatabase step.
I tried Arabidopsis thaliana genome and got no issues, with TAIR10.1 from NCBI
The genome size of the species I used is about 10Gb and I think maybe this is the problem.
Reproduction steps
the command I used for the discovery is
BuildDatabase -name lka sample.fa
nohup RepeatModeler --threads 30 -database lka &
The genome assembly I used for the program is Larix kaempferi
Log output
RepeatModeler Version 2.0.5
===========================
Using output directory = /mnt/annot/repeatm/RM_40.ThuJul41128262024
Search Engine = rmblast 2.14.1+
Threads = 40
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
Ninja , MAFFT 7.471,
CD-HIT 4.8.1 )
Random Number Seed: 1720092502
Database = lka .
- Sequences = 4655
- Bases = 13492429495
- N50 = 15986365
- Contig Histogram:
Size(bp) Count
-----------------------------------------------------------------------
78119697-83699528 | [ 3 ]
72539866-78119696 | [ 1 ]
66960035-72539865 | [ 2 ]
61380204-66960034 | [ 1 ]
55800373-61380203 | [ 6 ]
50220542-55800372 | [ 6 ]
44640711-50220541 | [ 5 ]
39060881-44640711 | [ 14 ]
33481050-39060880 | [ 14 ]
27901219-33481049 | [ 28 ]
22321388-27901218 | [ 52 ]
16741557-22321387 |* [ 99 ]
11161726-16741556 |* [ 151 ]
5581895-11161725 |*** [ 304 ]
2065-5581895 |************************************************** [ 3969 ]
Storage Throughput = excellent ( 1483.92 MB/s )
Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
and the repetitive content of the sequences. It is not imperative
that RepeatModeler completes all rounds in order to obtain useful
results. At the completion of each round, the files ( consensi.fa, and
families.stk ) found in:
/mnt/annot/repeatm/RM_40.ThuJul41128262024/
will contain all results produced thus far. These files may be
manually copied and run through RepeatClassifier should the program
be terminated early.
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 40007056 bp ( 40007056 non ambiguous )
- Num Contigs Represented = 595
- Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
-- Running RepeatScout on the sequences...
- RepeatScout: Running build_lmer_table ( l = 14 )..
- RepeatScout: Running RepeatScout.. : 2119 raw families identified
- RepeatScout: Running filtering stage.. 1982 families remaining
- RepeatScout: 00:03:40 (hh:mm:ss) Elapsed Time
- Large Satellite Filtering.. : 12 found in 00:00:08 (hh:mm:ss) Elapsed Time
- Collecting repeat instances...: 00:02:08 (hh:mm:ss) Elapsed Time
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Environment (please include as much of the following information as you can find out):
docker
- How did you install RepeatModeler? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?
I used a docker image of RepeatModeler called TEtools, which is maintained by Dfam-consortium. I used docker pull
command to download the image using latest tag.
- Which version of RepeatModeler do you have? The output of
RepeatModeler
without any options will be a help page with the version of the program displayed at the top.
No database indicated
/opt/RepeatModeler/RepeatModeler - 2.0.5
NAME
RepeatModeler - Model repetitive DNA
SYNOPSIS
RepeatModeler [-options] -database <XDF Database>
- Operating system and version. The output of
uname -a
andlsb_release -a
can be used to find this.
Linux cell-lab 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
I found why
The genome I used for de novo repeat sequence discovery was too large (about 12GB at scaffold level), when I separated the .fa file into 3 part, which was about 4GB, the issue didn't show up again.
I am not sure why that would be an issue. I just built a database with 11GB without a problem:
% ../BuildDatabase -name foo seq.fa
Building database foo:
Reading seq.fa...
Number of sequences (bp) added to database: 299305 ( 11005398711 bp )
% ls -al
total 13688540
drwxr-xr-x 2 rhubley repeat 188 Jul 16 10:42 ./
drwxr-xr-x 28 rhubley repeat 8192 Jul 10 13:36 ../
-rw-r--r-- 1 rhubley repeat 11340696 Jul 16 10:42 foo.nhr
-rw-r--r-- 1 rhubley repeat 3591744 Jul 16 10:42 foo.nin
-rw-r--r-- 1 rhubley repeat 424 Jul 16 10:42 foo.njs
-rw-r--r-- 1 rhubley repeat 2394440 Jul 16 10:42 foo.nnd
-rw-r--r-- 1 rhubley repeat 9404 Jul 16 10:42 foo.nni
-rw-r--r-- 1 rhubley repeat 1197252 Jul 16 10:42 foo.nog
-rw-r--r-- 1 rhubley repeat 2760204411 Jul 16 10:42 foo.nsq
-rw-r--r-- 1 rhubley repeat 7161395 Jul 16 10:40 foo.translation
-rw-r--r-- 1 rhubley repeat 11231130381 Jul 16 10:39 seq.fa
% wc -l foo.translation
299305 foo.translation
% fgrep -c ">" seq.fa
299305
I wonder if you might try rebuilding your database (with the full 12GB) and see if you have the full set of files .n (as above) and *.translation. Also check that *.translation contains the same number of lines as there are sequences in the the fasta file.
I am not sure why that would be an issue. I just built a database with 11GB without a problem:
% ../BuildDatabase -name foo seq.fa Building database foo: Reading seq.fa... Number of sequences (bp) added to database: 299305 ( 11005398711 bp ) % ls -al total 13688540 drwxr-xr-x 2 rhubley repeat 188 Jul 16 10:42 ./ drwxr-xr-x 28 rhubley repeat 8192 Jul 10 13:36 ../ -rw-r--r-- 1 rhubley repeat 11340696 Jul 16 10:42 foo.nhr -rw-r--r-- 1 rhubley repeat 3591744 Jul 16 10:42 foo.nin -rw-r--r-- 1 rhubley repeat 424 Jul 16 10:42 foo.njs -rw-r--r-- 1 rhubley repeat 2394440 Jul 16 10:42 foo.nnd -rw-r--r-- 1 rhubley repeat 9404 Jul 16 10:42 foo.nni -rw-r--r-- 1 rhubley repeat 1197252 Jul 16 10:42 foo.nog -rw-r--r-- 1 rhubley repeat 2760204411 Jul 16 10:42 foo.nsq -rw-r--r-- 1 rhubley repeat 7161395 Jul 16 10:40 foo.translation -rw-r--r-- 1 rhubley repeat 11231130381 Jul 16 10:39 seq.fa % wc -l foo.translation 299305 foo.translation % fgrep -c ">" seq.fa 299305
I wonder if you might try rebuilding your database (with the full 12GB) and see if you have the full set of files .n (as above) and *.translation. Also check that *.translation contains the same number of lines as there are sequences in the the fasta file.
Yep, the building progress was fluent, but when to run RepeatModeler
, the issue came out.
Can you show me the listing of files and counts as I did in my example? I just want to make sure none of the sizes are off.