lh3/fermi

fermi hangs on a very small dataset

Opened this issue · 5 comments

I've run fermi on a very small dataset containing 22 fasta records using the following cmd:

run-fermi.pl -k 200 -p cdhitout_0.85 <reads.fa>  | make -f -

however fermi hangs indefinitely. When I look at top I can see that fermi ropebwt is constantly in the sleep state:

45288 uqcskenn  20   0 24188  740  584 S    3  0.0   1:08.84 fermi ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp -                                                                                         
45447 uqcskenn  20   0 24188  740  584 S    2  0.0   1:08.00 fermi ropebwt -a bcr -v3 -btf cdhitout_0.90.ec.tmp - 

I've tried using both the git HEAD and with release 1.1

<reads.fa> contains:

>M00920:10:000000000-A292A:1:1101:2305:13136:1
CTTCTGGTGAAACCCACTCCCATGGTGTGACGGGCGGTGTGTACAAGACCCGGGAACGTATTCACCGCGACATGCTGATCCGCGATTACTAGCGATTCCGACTTCACGCAGTCGAGTTGCAGACTGCGATCCGGACTACGATCGGCTTTGTGAGATTCGCTCCGCCTCGCGGCTTGGCAACCCTCTGTACCGACCATTGTATGACGTGTGAAGCCCTACCCATAAGGGCCATGAGGACTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCGTTAAAGTGCCCAACCAAATGATGGCAATTAACGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACAT
>M00920:10:000000000-A292A:1:1101:24216:16298:1
CCCTTATCCTTAGTTACCAGCACCTCGGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAGGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCATCACACCATGGGAGTGGGTTGCTCCAGAAGTAGCTAGTCTAACCGCAAGGGGGACGGTTACCA
>M00920:10:000000000-A292A:1:1110:4340:7240:1
CAGATTGAACGCTGGCGGCATGCTTTACACATGCAAGTCGAACGGCAGCGGGGGCTTCGGCCCGCCGGCGAGTGGCGAACGGGTGAGTAATGCATCGGAACGTACCCATGTTGTGGGGGATAACGTAGCGAAAGCTACGCTAATACCGCATAAGCCCTGAGGGGGAAAGCGGGGGATTCTTCGGAACCTCGCGCAATTGGAGCGGCCGATGTCAGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCGGACTCCTCCGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGCAAGGGTGATC
>M00920:10:000000000-A292A:1:1110:21042:16009:1
ACCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCACATCTCTACGCATTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCACACTCGAGCCTTGCAGTCACAAACGCATTTCCCAGGTTAAGCCCGGGGATTTCACATCTGTCTTACAAAGCCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGTGCTTGTTCTTCAGTTCCCGTCATTGACAGTCTATGTTAGACCCCGCCGTTTCGTTCCTGCCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGAATGGCTGGATCAGGGT
>M00920:10:000000000-A292A:1:1101:19922:4365:1
ATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCATGAGCGACAGACCAGGTCCAGGGGGCTGCCTTCGCCTTCGATGTTCCTCCTGATATCTACGTATTTCACTGCTACACCCGGATTTCCACCCCCCTCTACCGCACTCTAGGCACACAGTCACAAACGCATTTCCCAGGTTAAGCCCGGGGGTTTCAAATCTGAATTATTTAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTCGGTATGACCGCGACTGCCAGCGGGTAGGAAGGCGGTACTTTTTATTCCGGTGCCGACATCCTCCCCGGATATTCACCGCGGCTATTTCTTTCCGTCCGACAGAGGTGTAAAACCCGAAGGCGAGCTTG
>M00920:10:000000000-A292A:1:1101:18095:13295:1
GGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGGAAGCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCGGTGGGGAAGAAATTGCACGGGTTAATACCCTGTGTAGATGACGGTACCCGACTAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGGTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAGACTGCCAAGCTGGAGTGTGGCAGAGGGGGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATCAGGAG
>M00920:10:000000000-A292A:1:2102:3086:14182:1
GTAGTGACCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCACATCTCTACGCATTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCACACTCCAGCCTGGCAGTCTCAAATGCAGTTCCCAGGTTGAGCCCGGGGCTTTCACATCTGACTTACCAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTAACGCGGCTGCTGGCACGTAGTTCGCCGGTGCTTCTTAGTCGGGTACCGTCATCTACACAGGATATTAGCCCGTGCAATTTCTTCCCCACCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGCATGGCTGGATCAGGCTTCCGCCC
>M00920:10:000000000-A292A:1:2108:13711:22806:1
GATTAAACGCTGGCGGCATGCCTTACACATGCAAGTCGAACGGCAGCACGGGGGCAACCCTGGTGGCGAGTGGTGGACGGGTGAGTAAAGCATCGGAACGTATCCTGAAGTGGAGTATAACGTAGCGAAAGTTACGCTAATACCGCATAGTCTGTGAGCAGGAAAGCAGGGGATCGCAAGACCTTGCGCTCTGGGAGCGGCCGATGTCGGATTAGCTAGTTGGGGGGGTAAAGGCCTACCAAGGCGCGGCTCCGTAGCGGGGATTGGAGTATGAAACGCCACACTGTGACTGAGAAACGGCCCGGACTCCTACGTGAGGAAGCAGCGGTGAATTTTTTCCAATGGGTTCAAGCC
>M00920:10:000000000-A292A:1:2110:11377:9313:1
GCATCGGAACGTGCCCTGGAATGGGGGATAACGTAGCGAAAGTTACGCTAATACCGCATATTCTGTGAGCAGGAAAGCAGGGGATCGCAAGACCTTGCGTTCTGGGATCGGCCGATGTCGTATGAGCTAGTTGGTGGGGAAAAGGCCTACCACGGCGACGATCCGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCCGTGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCGGTGGGGAAGAAATTGCATGGGTTAATTCCC
>M00920:10:000000000-A292A:1:1105:17264:25408:1
GAATTACTGGGCGTAAAGCGTGCGCAGGCGGCGCCATAAGACAGACGTGAAATCCCCGGGCTTAACCTGGGAACTGCGTTTGTGACTGTGGTGCTCGAGTGTGGCAGAGGGGGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCCCCTGGGTCAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCGAACTAGGTGTTGGGGAAGGAGACGTTCTTAGTACCGCAGCTAACGCGTGAAGTTCGCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATGGACA
>M00920:10:000000000-A292A:1:2105:19316:26848:1
ATCCGTAGCTGGTCTGAGAGGACGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATTCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCAGCAGGAACGAAACGGCTCTCTCTAACATAGGGAGTTAATGACGGTACCTGAAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCACAGGCGGCGCCATAAGACAGATGTGAAATCCCCGGGCTTAACCTGGGAAC
>M00920:10:000000000-A292A:1:1111:13173:15398:1
TGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACAGAACTTGCCAGAGATGGCTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCACCGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGCTGAAGTCAAGTCATCATGGCCCTTATGGGTAGGGCGTCACACGTCATACAATGGTCGGAACAGAGGGTTGCCAAGCCGCGAGGTGGAGCCAATCCCAGAAAACCGATCGTAGTCCGGATCGC
>M00920:10:000000000-A292A:1:1102:8010:26367:1
GCCTTACACATGCAAGTCGAACGGCAGCGGAACTTCGGGTGCCGGCGAGTGGCGAACGGGTGAGTAATGCATCGGAACGTGCCATTGAGTGGGGGATAACGTAGCGAAAGTTGCGCTAATACCGCATATTCTGTGAGCAGGAAAGCAGGGGACCGCAAGGCCTTGCGCTCTTTGAGCGGCCGATGTCAGATTAGCTAGTTGGTGAGGTAAAGGCTTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGT
>M00920:10:000000000-A292A:1:1106:8344:21464:1
GTTCCTACCATTGTAGCACGTGTGTAGCCCTGGGCATAAAGGCCATGATGACTTGACATCATCCCCTCCTTCCTCGCGTCTTACGACGGCAGTTTCTTTAGAGTTCCCAGCTTAACCTGTTGGCAACTAAAGATAGGGGTTGCGCTCGTTGCGGGACTTAACCCAACACCTCACGGCACGAGCTGACGACAGCCATGCAGCACCTGTGTGACGGCTCCCTTTCGGGCACCCTCAACTCTCATCGAGGTTCCGTCCATGTCAAGGGTAGGTAAGGTTTTTCGCGTTGCATCGAATTAATCCACATCATCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTTAATC
>M00920:10:000000000-A292A:1:1109:11262:3539:1
TTTACCCACCCAACACCTAGTTGACATAGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTGCATGAGCGTCAGTATCGGCCCAGGGGGCTGCCTTCGCCATAGGTGTTCCTCCCCATCTCTACGCTTTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCGTACTCTAGTGAGGCAGTCACAAACGCAGTTCCCAGGTTACGCCCGGGGATTTCACGCCTGTCTTACCAATCCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGTGCTTCTTATGCCGGTACCG
>M00920:10:000000000-A292A:1:1113:21063:11515:1
ACACAGGGTATTAACCCATGCGATTTCTTCCCGGCCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGCATGGCTGGATCAGGGTTGCCCCCATTGTCCAAAATTCCCCACTGCTGCCTCCCGGAGGAGTCTGGCCCGTGTCTCAGTTCCAGTGTGGCGGATCATCCTCTCAGACCCGCTCCAGATCGTCGCCTTGGTAAGCCGTTACCTCACCAACTAGCTAATCTGACATAGGCCGCTCAAAGAGCGCAAGGCCTTGCGGTCCCCTGCTTTCCTGCTCACAGAATATGCGGTATTAGCGCAACTTTCGCTACGTTATCCCCCACTCAATGGCACGTTCCGATGCATTACTCACC
>M00920:10:000000000-A292A:1:2109:18065:11577:1
CCTTTGTATTGTCCATTGTAGCACGTGTGTAGCCCAAATCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCAACTTAGAGTGCCCAACTTAATGATGGCAACTAAGCTTAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCAGTGTGACGGCTCCCTTTCGGGCACCCTCAACTCTCATCGAGGTTCCGTCCATGTCAAGGGTAGGTAAGGTTTTTCGCGTTGCATCGAATTAATCCACATCATCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTTAATC
>M00920:10:000000000-A292A:1:2113:10809:18271:1
GTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGATTAATTCGATGCAACGCGAAAAACCTCACCTACCCTTGACATGGACGGAACCTCGATGAGAGTTGAGGGTGCCCGAAAGGGAGCCGTCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGCTTAGTTGCCATCATTAAGTTGGGCACTCTAAGTTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAA
>M00920:10:000000000-A292A:1:2101:18998:6292:1
GTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACAGAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAAGGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGGGTAGGGCTTCACACGTCATACAATGGTCGGAACAGAGGGTTGCCAAGCCGCGAGGTGGAGCCAATCCCAGAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGAC
>M00920:10:000000000-A292A:1:2108:17778:22051:1
ATCCACAGAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACCTCGGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGGGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATC
>M00920:10:000000000-A292A:1:1104:5131:15907:1
GTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGTCGACTAGTCGTTCGGAGCAGCAATGCACTGAGTGACGCAGCTAACGCGTGAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGATTAATTCGATGCAACGCGAAAAACCTTACCTACCCTTGACATGTCTGGAGCCTTGGTGAGAGCCGAGGGTGCCTTCGGGAGCCAGAACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGT
>M00920:10:000000000-A292A:1:1113:7839:16644:1
CGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCATGAGCGTCAGTACAGGCCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCTGATCTCTACGCATTTCACTGCTACACCAGGAATTCCACACACTTCTGCCGTACTCTAGCCTTGCAGTCACAAACGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGTCTTACAAAAACGCCTCCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTTTTACCGCGGCTGCTGGCACGTTTTTAGCCGGTGCTTCTTAGTCCGGTACCGTCATCCATGGCCTATGTTAGAGAC
lh3 commented

With your command line, fermi should not use ropebwt. Can you find string ropebwt in your makefile?

Yes I can, full makefile shown below

FERMI=fermi
UNITIG_K=200
OVERLAP_K=240

all:cdhitout_0.85.p2.mag.gz

# Construct the FM-index for raw sequences
cdhitout_0.85.raw.fmd:../cdhitout_0.85.fa
    (cat ../cdhitout_0.85.fa) | $(FERMI) ropebwt -a bcr -v3 -btNf cdhitout_0.85.raw.tmp - > $@ 2> $@.log

# Error correction
cdhitout_0.85.ec.fq.gz:cdhitout_0.85.raw.fmd
    (cat ../cdhitout_0.85.fa) | $(FERMI) correct -t 2  $< - 2> $@.log | gzip -1 > $@

# Construct the FM-index for corrected sequences
cdhitout_0.85.ec.fmd:cdhitout_0.85.ec.fq.gz
    $(FERMI) fltuniq $< 2> cdhitout_0.85.fltuniq.log | $(FERMI) ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp - > $@ 2> $@.log

# Generate unitigs
cdhitout_0.85.p0.mag.gz:cdhitout_0.85.ec.fmd
    $(FERMI) unitig -t 2 -l $(UNITIG_K) $< 2> $@.log | gzip -1 > $@

cdhitout_0.85.p1.mag.gz:cdhitout_0.85.p0.mag.gz
    $(FERMI) clean $< 2> $@.log | gzip -1 > $@
cdhitout_0.85.p2.mag.gz:cdhitout_0.85.p1.mag.gz
    $(FERMI) clean -CAOFo $(OVERLAP_K) $< 2> $@.log | gzip -1 > $@
lh3 commented

I see. I was using an old version of run-fermi.pl. More recent version use ropebwt by default. Anyway, I can see the problem now: fltuniq has filtered out all the reads, while ropebwt is expecting some input and thus hanging for some reason. For the time being, you can edit makefile and change the line containing fltuniq to cat $< | $(FERMI) ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp - > $@ 2> $@.log. This skips fltuniq. I will look into the ropebwt issue later. But anyway, probably you won't get a good assembly from these reads.

lh3 commented

For small files, actually we'd better not use fltuniq anyway. I should consider to add an option to optionally skip fltuniq altogether.

thanks, specifying -B in run-fermi.pl prevents the hang as well