DessimozLab/read2tree

AttributeError when running OGset.py

Opened this issue ยท 11 comments

Hi Sina,

I've been trying to run read2tree in multispecies mode using a custom set of marker genes (200) for 24 primate species I downloaded from OMA using the tutorial you've provided. The first step of building the 01-03 folders worked but when running with the reads of the first species, it abruptly stops after finishing the mapping step, just when it starts to run OGset.py.

Traceback (most recent call last): File "/scratch_isilon/groups/compgen/scuadros/r2t/bin/read2tree", line 4, in <module> __import__('pkg_resources').run_script('read2tree==0.1.5', 'read2tree') File "/scratch_isilon/groups/compgen/scuadros/r2t/lib/python3.10/site-packages/pkg_resources/__init__.py", line 706, in run_script self.require(requires)[0].run_script(script_name, ns) File "/scratch_isilon/groups/compgen/scuadros/r2t/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1555, in run_script exec(script_code, namespace, namespace) File "/scratch_isilon/groups/compgen/scuadros/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in <module> File "/scratch_isilon/groups/compgen/scuadros/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 360, in main File "/scratch_isilon/groups/compgen/scuadros/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/OGSet.py", line 494, in add_mapped_seq AttributeError: 'Mapper' object has no attribute 'og_records'
I have installed read2tree from source, creating an environment with conda, installing all dependencies with conda and then downloading and installing read2tree. I tested the installation with the toydataset in read2tree/tests and it worked without problems.

I've attached the mplogfile as well.
mplog.log

Lastly I'm running this in an Slurm Scheduled HPC, asking for 70 GB of memory.

I don't know if this is related to the fastqs I am giving as input, or the installation of read2tree. I would appreciate if you have an idea on this problem.

Many thanks!

Sebastian

Hi Sebastian
Thanks for contacting us. I closely read the description and the log. I appreciate all the details you provided us.
However, I just ran read2tree with the same procedure for marker gene set and a read dataset and I couldn't reproduce the error.

It would be great if you could start over from an empty folder and run with read2tree --debug, so we will have the bam files. The new run will help us to know whether there was a rare memory/disk issue of the system.

I would be happy to try read2tree with your read dataset if it is possible to share it with us.

Best,
Sina

Ps. It would be also helpful to make sure that all the .phy files are still in the folder 03_align_aa and 03_align_dna.
You could check or send us a file with output of ls -alth output/*.

output/03_align_dna:
total 14M
drwxr-sr-x 12 usr -pr-g 4.0K Mar 11 13:21 ..
-rw-r--r--  1 usr -pr-g  39K Mar 11 11:50 OG1207049.phy
-rw-r--r--  1 usr -pr-g  77K Mar 11 11:50 OG1201791.phy
drwxr-sr-x  2 usr -pr-g 8.0K Mar 11 11:50 .
-rw-r--r--  1 usr -pr-g  38K Mar 11 11:50 OG1196579.phy
-rw-r--r--  1 usr -pr-g  40K Mar 11 11:50 OG948889.phy
-rw-r--r--  1 usr -pr-g 106K Mar 11 11:50 OG962117.phy
-rw-r--r--  1 usr -pr-g  73K Mar 11 11:50 OG971152.phy
...
output/04_mapping_SAMPLE:
total 8.8M
drwxr-sr-x 12 usr -pr-g 4.0K Mar 11 13:21 ..
-rw-r--r--  1 usr -pr-g 9.9K Mar 11 13:21 MACMU_OGs_sc.txt
drwxr-sr-x  2 usr -pr-g 8.0K Mar 11 13:21 .
-rw-r--r--  1 usr -pr-g 8.1K Mar 11 13:21 MACMU_OGs_cov.txt
-rw-r--r--  1 usr -pr-g 291K Mar 11 13:21 MACMU_OGs_consensus.fa
-rw-r--r--  1 usr -pr-g  78K Mar 11 13:21 MACMU_OGs.fa.bam
-rw-r--r--  1 usr -pr-g 9.4K Mar 11 13:21 PANPA_OGs_sc.txt
-rw-r--r--  1 usr -pr-g 8.0K Mar 11 13:21 PANPA_OGs_cov.txt
-rw-r--r--  1 usr -pr-g 264K Mar 11 13:21 PANPA_OGs_consensus.fa
...

Hi Sina,

Thanks for your reply! I have run the same command with teh --debug option, here's the associated mplog file
mplog_wdebug.log

The .phy files in the 03 directories appear to be there. Here's the output of the ls of the ouput directory
list_out.txt

I would really appreciate if you could try from your side with my input , but it appears the fastqs are a bit too big (376 MB) for github, is there a way I can share them with you via a drive or something else? Basically these are a 4% subset of the reads for samples from Kuderna et al.2023. I can also try and subset a smaller % of the reads. Let me know what suits you best.

Thanks again!

Sebastian

Thanks. I assume you got the same error (the error appearing in terminal is not reported in the log file). Could you please try google drive to upload the your read set and share the link with me?

Yes the error is the same from before. Here's the link for the folder, it contains the fastqs for one sample, the folder with the marker genes and the dna_ref fasta with all the OGs. Hope you can reproduce it from your side.

Thanks again!

Sebastian

Thanks for providing the data.

It seems that the sequencing coverage is very low to generate the consensus sequence for building the MSA and tree as all of the 04_mapping_PD_0001_paired_subset_1/{species}_OGs_cov.txt are nan, they should be some integer values in at least few of genes. I admit that the error report and logging should be improved.

I can see there are around 2.6m reads in the sample so the sequencing coverage would be 0.2x. We had some cases that the we were able to infer trees with such coverage, however this depends on the sampling strategy of reads and the gene markers. In your case, very few of the reads were mapped to the 200 genes.

Would it be possible for you to run with more reads? maybe a coverage of 5x, 5*3Gbp/300bp=50m reads.

Best,
Sina

Hi Sina,

Thanks for running it from your side!

Interesting, yes the coverage factor makes sense, specially given that there's only 200 markers.

The sampling setup was originally designed to give as input to MitoFinder to eventually build mitochondrial phylogenies. 4% of a single fastq seemed to do the trick for most cases (also because the program used A LOT of memory and given it a higher proportion of reads would have been a computational burden).

Also a small detail is that these reads have been trimmed with cutadapt. From what I've understood read2tree gets rid of the necessity of all these filtering procedures, so I don't know if that can affect something (or maybe not at all).

I will try with sampling a higher proportion of reads to get to the numbers you propose and run read2tree again. I'll let you know if it works then.

Thanks a lot again.

Sebastian

Hi Sina,

I've been trying to run with different subsampling levels and even giving it the whole fastq (from a given lane, the samples are multiplexed so around 65M reads) but the error stays the same. I've also tried other samples to no avail. The *_OGs_cov.txt files only have nan values, which I don't finish to understand why. The 24 species chosen for the OGs should be representative of the whole primate phylogeny (also because there are only primates available with OMA).

Also the first step to generate the 01-03 directories should only be run once unless you change the set of OGs/species you build your reference with right?

I can also get the full fastq for a given sample (no demultiplex) by converting alignments we already produced, to have full 30X coverage. But other than that I don't have much more ideas to explore.

Thanks.

Sebastian

Thanks for the follow-up and sorry for late reply. I'm downloading the ERR10941432 dataset and I will run read2tree and update you afterwards.

It looks working

$  ls -alth ERR10941432.fastq 
-rw-r--r-- 1 s r 25G Mar 31 13:58 ERR10941432.fastq

$ read2tree --debug  --tree     --standalone_path  marker_genes  --reads  ../data/ERR10941432.fastq   --dna_reference  dna_ref.fa   --threads 48 --output_path output1  


 $ head  04_mapping_ERR10941432//MACFA_OGs_cov.txt 
#species,og,gene_id,coverage,std
MACFA,OG960714,MACFA25325,1.4259043173862311,0.6619733979873893
MACFA,OG969202,MACFA24806,2.400735294117647,1.4314223861522688
MACFA,OG1134907,MACFA25459,1.637313432835821,0.7773540695983083
MACFA,OG1185253,MACFA12063,nan,nan
MACFA,OG1215693,MACFA25545,1.9624470018170805,1.0223640121290376
MACFA,OG975665,MACFA05452,2.206997084548105,0.7487661512532715

could you also send us the mplog.log file, the output log, and the command line?
I'm also wondering whether you started a new folder when re-running read2tree with the new dataset. or (or removing the mplog.log file and the output folder). Otherwise, read2tree uses the faulty output of previous unfinished run.

Ps. this is the tree that I got (after rooting, output of IQTree/read2tree is not rooted), The leaves names are mentioned in https://omabrowser.org/oma/export_markers after selecting primates. The sample is grouped with the rest of species Platyrrhini in the reference gene markers, which looks good. Also, all the Hominidae from the reference are grouped together as well.

ERR10941432: Cheracebus lugens
Platyrrhini; Pitheciidae; Callicebinae; Cheracebus

AOTNA : Aotus nancymaae
Platyrrhini; Aotidae; Aotus

CALJA: Callithrix jacchus
Platyrrhini; Cebidae; Callitrichinae; Callithrix; Callithrix

CEBIM: Cebus imitator
Platyrrhini; Cebidae; Cebinae; Cebus

SAIBB: Saimiri boliviensis boliviensis
Platyrrhini; Cebidae; Saimiriinae; Saimiri; Saimiri boliviensis

Screenshot 2024-03-31 at 20 00 50

Hi Sina,

Apologies for the late reply. I've run again read2tree in a couple of different samples and it worked this time around with 65-69M reads! When it was failing before with the same samples I did delete the output directory but not the mplog file so I don't know if that could be the reason why it failed previously.

My question now is, if read2tree is run in multispecies mode, and for one pair of sequences it fails. Should the mplogfile be deleted together with the respective subdirectories in the output directory before fixing the error and launching read2tree again?

Thanks a lot for the help. I think with this last question resolved the issue can be closed.

Sebastian

You're welcome.

Read2tree tries to parse the mplog.log file and output folder and estimate the progress and resume it. I'm not sure how it can handle such scenario. Sorry for that. @dvdylus who wrote the code might be aware of. Otherwise, I would find out by trial and error.