clemgoub/dnaPipeTE

Problem at RepeatMasker stage

PatrickCKennedy opened this issue · 2 comments

Dear Clément,

Thank you for creating dnaPipeTE.

I am currently running the programme on my local computer, using the following commands:

sudo docker run --platform linux/amd64 -it -v ~/Project:/mnt clemgoub/dnapipete:latest

python3 dnaPipeTE.py \
-input /mnt/data/SAMPLE_R1.fastq.gz \
-output /mnt/Patrick_28Nov2023a \
-sample_size 1000 \
-sample_number 2  \
-RM_t 0.25 \
-cpu 8

(I have set the sample_size to be extremely low here, just as a practice run, as I was running into issues and I want to iron them out before running the full sample size.)

The Trinity steps seem to run fine, but then it hits a snag when it comes to the RepeatMasker stages:


#######################################
### REPEATMASKER to anotate contigs ###
#######################################

RepeatMasker version 4.1.3
Search Engine: NCBI/RMBLAST [ 2.11.0+ ]

Using Master RepeatMasker Database: /opt/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.6
  Date     : 2022-04-12
  Families : 19,025

Species/Taxa Search:
  Homo sapiens [NCBI Taxonomy ID: 9606]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Primates;Haplorrhini;Simiiformes
  1339 families in ancestor taxa; 49 lineage-specific families

Building general libraries in: /opt/RepeatMasker/Libraries/CONS-Dfam_3.6/general
RepeatMasker::createLib(): Error invoking /opt/rmblast/bin/makeblastdb on file /opt/RepeatMasker/Libraries/CONS-Dfam_3.6/general.working/is.lib.
Traceback (most recent call last):
  File "dnaPipeTE.py", line 698, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "dnaPipeTE.py", line 381, in __init__
    self.repeatmasker_run()
  File "dnaPipeTE.py", line 400, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/Patrick_28Nov2023a/Trinity.fasta.out'

In the output folder, there is no file called 'Trinity.fasta.out', although there is one called 'Trinity.fasta'.

There seem to be two issues there: (1) an issue with makeblastdb and (2) the fact that there is no file called 'Trinity.fasta.out'.

I am not aware of a Hymenoptera-specific library (which would be relevant for my species), so I have kept it as the default library 'RepeatMaskerLib.h5'. I hope that that is acceptable.

Many thanks if you can help solve this issue!

As a quick note, I've also tried downloading RepeatMasker.lib from an online source and pointing to it using -RM_lib:

python3 dnaPipeTE.py -input /mnt/data/SAMPLE_R1.fastq.gz -output /mnt/Patrick_27Nov2023 -sample_size 1000 -sample_number 2 -RM_t 0.25 -cpu 8 -RM_lib /mnt/data/RepeatMasker.lib

...but this leads to the same error:

RepeatMasker version 4.1.3
Search Engine: NCBI/RMBLAST [ 2.11.0+ ]
Using Custom Repeat Library: /mnt/data/RepeatMasker.lib

Building general libraries in: /opt/RepeatMasker/Libraries//general
RepeatMasker::createLib(): Error invoking /opt/rmblast/bin/makeblastdb on file /opt/RepeatMasker/Libraries//general.working/is.lib.
Traceback (most recent call last):
  File "dnaPipeTE.py", line 698, in <module>
    RepeatMasker(config['DEFAULT']['RepeatMasker'], args.RepeatMasker_library, args.RM_species, args.cpu, args.output_folder, args.RM_threshold)
  File "dnaPipeTE.py", line 381, in __init__
    self.repeatmasker_run()
  File "dnaPipeTE.py", line 400, in repeatmasker_run
    with open(self.output_folder+"/Trinity.fasta.out", 'r') as trinity_handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/Patrick_27Nov2023/Trinity.fasta.out'

Problem solved!

Turns out I was encountering the same issue as this thread:
Dfam-consortium/RepeatMasker#148

The problem was simply insufficient memory allocation.

Adding the following line solved the issue:
export BLASTDB_LMDB_MAP_SIZE=100000000

All the best,

Patrick