Dfam-consortium/RepeatModeler

Could not open *.translation file for reading!

Opened this issue · 4 comments

Describe the issue
When I use RepeatModeler for de novo repeat sequences finding, It said that the program could not open a *.translation file for reading, which was generated in the BuildDatabase step.

I tried Arabidopsis thaliana genome and got no issues, with TAIR10.1 from NCBI

The genome size of the species I used is about 10Gb and I think maybe this is the problem.

Reproduction steps

the command I used for the discovery is

BuildDatabase -name lka sample.fa
nohup RepeatModeler --threads 30 -database lka &

The genome assembly I used for the program is Larix kaempferi

Log output

RepeatModeler Version 2.0.5
===========================
Using output directory = /mnt/annot/repeatm/RM_40.ThuJul41128262024
Search Engine = rmblast 2.14.1+
Threads = 40
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
                                   Ninja , MAFFT 7.471,
                                   CD-HIT 4.8.1 )
Random Number Seed: 1720092502
Database = lka .
  - Sequences = 4655
  - Bases = 13492429495
  - N50 = 15986365
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  78119697-83699528 |                                                   [ 3 ]
  72539866-78119696 |                                                   [ 1 ]
  66960035-72539865 |                                                   [ 2 ]
  61380204-66960034 |                                                   [ 1 ]
  55800373-61380203 |                                                   [ 6 ]
  50220542-55800372 |                                                   [ 6 ]
  44640711-50220541 |                                                   [ 5 ]
  39060881-44640711 |                                                   [ 14 ]
  33481050-39060880 |                                                   [ 14 ]
  27901219-33481049 |                                                   [ 28 ]
  22321388-27901218 |                                                   [ 52 ]
  16741557-22321387 |*                                                  [ 99 ]
  11161726-16741556 |*                                                  [ 151 ]
  5581895-11161725  |***                                                [ 304 ]
  2065-5581895      |************************************************** [ 3969 ]

Storage Throughput = excellent ( 1483.92 MB/s )

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
      and the repetitive content of the sequences.  It is not imperative
      that RepeatModeler completes all rounds in order to obtain useful
      results.  At the completion of each round, the files ( consensi.fa, and
      families.stk ) found in:
      /mnt/annot/repeatm/RM_40.ThuJul41128262024/ 
      will contain all results produced thus far. These files may be 
      manually copied and run through RepeatClassifier should the program
      be terminated early.


RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40007056 bp ( 40007056 non ambiguous )
   - Num Contigs Represented = 595
   - Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 14 )..
   - RepeatScout: Running RepeatScout.. : 2119 raw families identified
   - RepeatScout: Running filtering stage.. 1982 families remaining
   - RepeatScout: 00:03:40 (hh:mm:ss) Elapsed Time
   - Large Satellite Filtering.. : 12 found in 00:00:08 (hh:mm:ss) Elapsed Time
   - Collecting repeat instances...: 00:02:08 (hh:mm:ss) Elapsed Time
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!

Environment (please include as much of the following information as you can find out):

docker

  • How did you install RepeatModeler? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?

I used a docker image of RepeatModeler called TEtools, which is maintained by Dfam-consortium. I used docker pull command to download the image using latest tag.

  • Which version of RepeatModeler do you have? The output of RepeatModeler without any options will be a help page with the version of the program displayed at the top.
No database indicated

/opt/RepeatModeler/RepeatModeler - 2.0.5
NAME
    RepeatModeler - Model repetitive DNA

SYNOPSIS
      RepeatModeler [-options] -database <XDF Database>

  • Operating system and version. The output of uname -a and lsb_release -a can be used to find this.
Linux cell-lab 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

I found why

The genome I used for de novo repeat sequence discovery was too large (about 12GB at scaffold level), when I separated the .fa file into 3 part, which was about 4GB, the issue didn't show up again.

I am not sure why that would be an issue. I just built a database with 11GB without a problem:

% ../BuildDatabase -name foo seq.fa
Building database foo:
  Reading seq.fa...
Number of sequences (bp) added to database: 299305 ( 11005398711 bp )

% ls -al
total 13688540
drwxr-xr-x  2 rhubley repeat         188 Jul 16 10:42 ./
drwxr-xr-x 28 rhubley repeat        8192 Jul 10 13:36 ../
-rw-r--r--  1 rhubley repeat    11340696 Jul 16 10:42 foo.nhr
-rw-r--r--  1 rhubley repeat     3591744 Jul 16 10:42 foo.nin
-rw-r--r--  1 rhubley repeat         424 Jul 16 10:42 foo.njs
-rw-r--r--  1 rhubley repeat     2394440 Jul 16 10:42 foo.nnd
-rw-r--r--  1 rhubley repeat        9404 Jul 16 10:42 foo.nni
-rw-r--r--  1 rhubley repeat     1197252 Jul 16 10:42 foo.nog
-rw-r--r--  1 rhubley repeat  2760204411 Jul 16 10:42 foo.nsq
-rw-r--r--  1 rhubley repeat     7161395 Jul 16 10:40 foo.translation
-rw-r--r--  1 rhubley repeat 11231130381 Jul 16 10:39 seq.fa

% wc -l foo.translation 
299305 foo.translation
% fgrep -c ">" seq.fa
299305

I wonder if you might try rebuilding your database (with the full 12GB) and see if you have the full set of files .n (as above) and *.translation. Also check that *.translation contains the same number of lines as there are sequences in the the fasta file.

I am not sure why that would be an issue. I just built a database with 11GB without a problem:

% ../BuildDatabase -name foo seq.fa
Building database foo:
  Reading seq.fa...
Number of sequences (bp) added to database: 299305 ( 11005398711 bp )

% ls -al
total 13688540
drwxr-xr-x  2 rhubley repeat         188 Jul 16 10:42 ./
drwxr-xr-x 28 rhubley repeat        8192 Jul 10 13:36 ../
-rw-r--r--  1 rhubley repeat    11340696 Jul 16 10:42 foo.nhr
-rw-r--r--  1 rhubley repeat     3591744 Jul 16 10:42 foo.nin
-rw-r--r--  1 rhubley repeat         424 Jul 16 10:42 foo.njs
-rw-r--r--  1 rhubley repeat     2394440 Jul 16 10:42 foo.nnd
-rw-r--r--  1 rhubley repeat        9404 Jul 16 10:42 foo.nni
-rw-r--r--  1 rhubley repeat     1197252 Jul 16 10:42 foo.nog
-rw-r--r--  1 rhubley repeat  2760204411 Jul 16 10:42 foo.nsq
-rw-r--r--  1 rhubley repeat     7161395 Jul 16 10:40 foo.translation
-rw-r--r--  1 rhubley repeat 11231130381 Jul 16 10:39 seq.fa

% wc -l foo.translation 
299305 foo.translation
% fgrep -c ">" seq.fa
299305

I wonder if you might try rebuilding your database (with the full 12GB) and see if you have the full set of files .n (as above) and *.translation. Also check that *.translation contains the same number of lines as there are sequences in the the fasta file.

Yep, the building progress was fluent, but when to run RepeatModeler, the issue came out.

Can you show me the listing of files and counts as I did in my example? I just want to make sure none of the sizes are off.