agshumate/Liftoff

Running time

Closed this issue ยท 22 comments

Hi,

I am using Liftoff for a genome with ~50Mb in size and ~12,000 annotated genes. The GFF file includes 132471 features for the following: CDS, RNase_MRP_RNA ,RNase_P_RNA ,SRP_RNA ,exon ,five_prime_UTR ,gene ,mRNA ,ncRNA ,ncRNA_gene ,pseudogene ,pseudogenic_transcript ,rRNA ,snRNA ,snoRNA ,tRNA and three_prime_UTR.

It just finished running with these parameters, -p 20 -copies -a 0.4 -s 0.4 -sc 0.5. The intermediate sam file is 183Mb in size. The whole process seemed to take about 15 hours. Is this running time something you would expect? How long did it take for you to run this on the chimpanzee genome and the wheat genome?

hi, that is much longer than expected. it takes a little over an hour for the chimpanzee genome. are there features such as "region" or similar in your GFF file in addition to those you listed?

For each chromosome, there is this line, 1 Broad Institute chromosome 1 7978604 . . . ID=chromosome:1;Alias=CM001231.1

Do you think this would be the problem of slow running time?

yes i believe that is the problem. i am working on a fix for this now, but for now if you remove those from your gff file it should be much faster.

Only with the gene, mRNA, exon and CDS features, it only takes less than 2 minutes per genome!

Hey, I'm having a similar issue. Trying to lift over the UCSC refgene gtf file and it's been running for a few days.

Here's the command:

liftoff -g hg38.refGene.gtf.gz -f feature_types.txt chm13.draft_v1.0.fasta.bgz hg38.fa.bgz

The contents of feature_types.txt:

3UTR
5UTR
CDS
exon
start_codon
stop_codon
transcript

And the last few lines of output:

extracting features
2020-10-01 01:22:01,115 - INFO - Committing changes: 1893000 features
2020-10-01 01:22:01,736 - INFO - Populating features table and first-order relations: 1893859 features
2020-10-01 01:22:01,736 - INFO - Creating relations(parent) index
2020-10-01 01:22:03,880 - INFO - Creating relations(child) index
2020-10-01 01:22:06,128 - INFO - Creating features(featuretype) index
2020-10-01 01:22:07,328 - INFO - Creating features (seqid, start, end) index
2020-10-01 01:22:09,300 - INFO - Creating features (seqid, start, end, strand) index
2020-10-01 01:22:11,421 - INFO - Running ANALYZE features

Any suggestions?

Hi there,
First i recommend updating to the latest version released friday (v1.50). secondly, if you have an annotation file in gff3 format rather than gtf, i recommend using that. I've seen some cases with gtf files where the parent/child relationships are not read in correctly causing the program to hang. The issue should be resolved in 1.5.0, but if you still experience problems try a gff3 file. gffread can convert your file from gtf to gff3 if you don't have access to a gff3 already.

Hey, so I fiddled around with this some more and the long run actually comes from gffutils. Do you happen to have a link to some hg38 or hg19 GFF annotations that you've run liftoff on in a reasonable amount of time?

hmm is it still running for a few days? I havent seen that issue before with regards to gffutils but i will definitely look into it with that file. I have used the GenBank annotation without issue https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.gff.gz

hmm is it still running for a few days?

Unfortunately yes. I got it to work with the Gencode gff3 file but ran out of RAM (16Gb) and am now breaking up the file into individual chromosomes and re-running.

The slow runtime of gffutils is a longstanding issue: https://www.biostars.org/p/152517/

Just thinking out loud but I wonder if it's possible to do away with it and parse the GFF file to get the relationship hierarchy directly.

Anyway, thanks for the input and the link. I'll poke around some more.

Sorry, slight correction, look like it's the get_gene_sequences which takes the bulk of the run time. I must be doing something silly.

Here's the problematic line for me:

parent_seq = chrom_seq[parent.start - 1: parent.end].seq

Reading each individual gene takes forever (some up to 16 seconds). Switching to reading the entire chromosome in the beginning and taking sequence subsets results in much faster performance. I'll try to get a PR ready some time today for you to try out. I'd be really curious to see if you see the same results.

that line is subsetting the chromosome sequence, which is only read once. I'm not sure I understand how what you are suggesting is different, but perhaps i will see in the PR. I'm trying out a couple things on my end too so hopefully between the two of us we can resolve this soon!

Here it is: #51

If you have a moment, would you mind trying this to see if you get reasonable results?

Getting closer. Got reference_all_genes.fa in the intermediate_files directory but for some reason minimap2 doesn't generate any alignments.

Sorry for the spam but just to follow up:

Got reference_all_genes.fa in the intermediate_files directory but for some reason minimap2 doesn't generate any alignments.

This was because the indexing step had failed and the .mmi file was of size 0. So there was a file there but it was empty and this led to no alignments being generated. Recreating the file fixed the problem.

Finally got some lifted over features!!! ๐ŸŽ‰ ๐ŸŽ‰ Thanks so much for you help.

hi there, i am glad you got it working! however i would still like to investigate this a bit more. when i tested it on my end just with GRCh38 to GRCh38 again using this gtf file https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz

is this the same one you are using? when i tested it without the pull request, it finished in about 6 hours for me and the bulk of the time was spend actually in the alignment step. I did not experience running for several days like you described. I also tested with your pull request and that worked just fine however the total run time was not really affected since for me, the alignment step was the limiting factor.

@agshumate
Just in case, I face the same situation with the hang. I have the problem with gff3 file downloaded from Ensembl (non-human specie):
ftp://ftp.ensembl.org/pub/release-102/gff3/bos_taurus/Bos_taurus.ARS-UCD1.2.102.gff3.gz

The liftoff (v1.5.1) to hang here:

2021-02-04 14:20:58,194 - INFO - Populating features table and first-order relations: 1292379 features
2021-02-04 14:20:58,195 - INFO - Updating relations
2021-02-04 14:21:20,789 - INFO - Creating relations(parent) index
2021-02-04 14:21:22,416 - INFO - Creating relations(child) index
2021-02-04 14:21:25,612 - INFO - Creating features(featuretype) index
2021-02-04 14:21:28,328 - INFO - Creating features (seqid, start, end) index
2021-02-04 14:21:30,197 - INFO - Creating features (seqid, start, end, strand) index
2021-02-04 14:21:31,966 - INFO - Running ANALYZE features

Decided to remove undesired lines from the gff3 file didn't helped the process:

zcat Bos_taurus.ARS-UCD1.2.102.gff3.gz | awk '($0~/^#*[A-z]|^#\!/)||($3~/^region|gene|CDS|exon|RNA|UTR/)' | pigz -p8 > Bos_taurus.ARS-UCD1.2.102_noRegion.gff3.gz

The diff -y of the 2 gtf files looks like this:
image

Currently running using these options and seems to be stuck at the same line (ANALYZE):

liftoff -g Bos_taurus.ARS-UCD1.2.102_noRegion.gff3.gz \
-o ARS-UCD1.2_Btau5.0.1Y.gff3 \
-u unmapped_features.txt \
-exclude_partial \
-copies \
-chroms chroms.txt \
-p 8 \
Bos_taurus.ARS-UCD1.2.dna.toplevel.fa.gz ARS-UCD1.2_Btau5.0.1Y.fa.gz

Any help is appreciated

Hi there,
I was not able to reproduce this with Bos_taurus.ARS-UCD1.2.102.gff3.gz. For me liftoff finished successfully in less than 1 hour. have any files been created in the intermediate_files directory?

Hi there,
I was not able to reproduce this with Bos_taurus.ARS-UCD1.2.102.gff3.gz. For me liftoff finished successfully in less than 1 hour. have any files been created in the intermediate_files directory?

Yes after 16hrs only had chr1 and chr10 fa files in the intermediate_files. Still trying to somehow run it but It doesn't seem to be using parallel processes at all. The 16hr mark was on an HPC (8 cpus sharedmem 64GB). On a local machine also I have the same problem of super high runtime. just in case helps, this is the conda env running the tool:

channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - ca-certificates=2021.1.19=h06a4308_0
  - certifi=2020.12.5=py38h06a4308_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20191231=h14c3975_1
  - libffi=3.3=he6710b0_2
  - libgcc-ng=9.1.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - ncurses=6.2=he6710b0_1
  - openssl=1.1.1i=h27cfd23_0
  - pip=20.3.3=py38h06a4308_0
  - python=3.8.5=h7579374_1
  - readline=8.1=h27cfd23_0
  - setuptools=52.0.0=py38h06a4308_0
  - sqlite=3.33.0=h62c20be_0
  - tk=8.6.10=hbc83047_0
  - wheel=0.36.2=pyhd3eb1b0_0
  - xz=5.2.5=h7b6447c_0
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - argcomplete==1.12.2
    - argh==0.26.2
    - biopython==1.76
    - decorator==4.4.2
    - gffutils==0.10.1
    - interlap==0.2.6
    - liftoff==1.5.2
    - networkx==2.4
    - numpy==1.19.0
    - pyfaidx==0.5.8
    - pysam==0.16.0.1
    - simplejson==3.17.2
    - six==1.15.0
    - ujson==3.2.0

@MazdaX if you're feeling adventurous, you could try the code in #51

That's what sped up execution significantly for me.

After more investigation, I've realized that the issue is related to using bgzipped fasta files. The way it's currently implemented, extracting the gene sequences from bgzipped files is very slow. The easiest fix for now is just to decompress your fasta files before running liftoff. Also, Peter's code in #51 appears to fix this too

Indeed the bgzip compression was the issue. Thanks a lot for both your solutions @pkerpedjiev @agshumate