Error while recovering original scaffolding
Closed this issue · 14 comments
Hello again,
after last times succesful and quick bug handling (again, many thanks!), I've encountered a new error message while trying to recover the original scaffolding after shredding. My workflow is :
agouti.py shred -assembly genome.fasta -gff genome.gff -p scaffold
bwa index -p contigmapping -a bwtsw scaffold.ctg.fasta
bwa mem -t 80 -T 30 -P -k 20 contigmapping R1_reads.fastq R2_reads.fastq > mapping.ctg.sam
samtools view -Sb mapping.ctg.sam > mapping.ctg.bam
python agouti.py scaffold -assembly scaffold.ctg.fasta -bam mapping.ctg.bam -gff scaffold.ctg.gff -outdir ./run1 -shredpath scaffold.shred.info.txt
The error i get is
2016-08-04 06:36:27,146 - INFO - AGOUTI_PATH PROGRESS - Analyzing scaffolding paths
2016-08-04 06:36:27,148 - INFO - AGOUTI_PATH PROGRESS - [BEGIN] Reading file with shred info
2016-08-04 06:36:27,931 - ERROR - AGOUTI_PATH PROGRESS - Number of shredded contigs parsed: 242231
2016-08-04 06:36:27,931 - INFO - AGOUTI_PATH PROGRESS - [DONE]
2016-08-04 06:36:27,931 - INFO - AGOUTI_PATH PROGRESS - [BEGIN] Recovring original scaffolding
Traceback (most recent call last):
File "agouti.py", line 283, in
main()
File "agouti.py", line 280, in main
args.func(args)
File "agouti.py", line 219, in run_scaffolder
prefix)
File "src/agouti_path.py", line 38, in agouti_path_main
agPathDebug)
File "src/agouti_path.py", line 114, in recover_untouched_sequences
preIndex = int(preGeneID.split('_')[1])
ValueError: invalid literal for int() with base 10: 'shred'
I'm under the impression it tries to recover unshredded genes from the gff but somehow still gets the shredded ones and therefore has trouble with the ID. My workaround so far is:
`if curCtg in dGFFs:
curCtgGeneModel5 = dGFFs[curCtg][0]
if not curCtgGeneModel5.fake:
curGeneID = curCtgGeneModel5.geneID
if curGeneID.split('_')[1] != "shred":
curIndex = int(curGeneID.split('_')[1])`
starting line 116 in agouti_path.py, but the rerun has not reached that phase yet, because it is still graph building. Which brings me to the question, whether there is a way to recover the outcome of
INFO - AGOUTI_SCAFFOLDING PROGRESS - Building graph from joining reads pairs
from a previous run. Hope this is not an error thrown by some stupid mistake in the command line from my side.
Cheers,
Nico
oh and if it helps here is a line in the gff of an untouched gene
Scaffold22931 I5K gene 213 938 . - . ID=BGER028140;Name=Dmel-Tmp-reg-2;Dbxref=i5kP_homolog_AGAMB:AGAP004954-PA;Dbxref=i5kP_homolog_AMEL:GB46617-PA;Dbxref=i5kP_homolog_BMORI:BGIBMGA002078-PA;Dbxref=i5kP_homolog_CELEG:K01G5.10;Dbxref=i5kP_homolog_DMELA:FBpp0072460;Dbxref=i5kP_homolog_HSAPI:ENSP00000238379;Dbxref=i5kP_homolog_MMUSC:ENSMUSP00000036337;Dbxref=i5kP_homolog_TCAST:TC003205;Dbxref=i5kP_homolog_DMELA:FBpp0070645;
and this is of a shredded gene:
Scaffold1_0_0 I5K gene 7742 21351 . - . ID=agouti_shred_gene_1_0;
Hello Nico,
I don't think the error came from your command line. This error should not happen with a fix I made last month. Hmm....
This line below seems strange:
Scaffold1_0_0 I5K gene 7742 21351 . - . ID=agouti_shred_gene_1_0;
Is the scaffold ID in your scaffold Fasta like "Scaffold1", and the AGOUTI shred gave your "Scaffold1_0_0"? I just checked my shred results, which was generated from the same Fasta and GFF file I believe. I get "Scaffold1_0" in the shredded Fasta and GFF files, rather than "Scaffold1_0_0".
And is it possible for you to send me the DEBUG file from this module (the file should be under agouti_path folder)? I saw you did not specify --debug in the command line. Could you please use it and re-run it?
Simo
Thanks for taking care of it. If it helps, my original scaffold name was indeed scaffold1_0 because i had to manually cut out something in between and make the initial scaffold1 into scaffold1_0 and scaffold1_1. I could easily change the underscore to anything else if it messes with the program
I might know where it goes wrong. Let me see if I can make a quick fix.
Meanwhile, I just updated my comment upstairs. Could you give a look and provide me the file, just in case the quick fix I am thinking fails. You can send the debug file to simozhan@indiana.edu
I'll run the debug version and let you know when it finishes. Will take some time though, the genome is quite large and lots of reads
Can I also ask you what version of AGOUTI you are using? I think you might use an older version.
File "src/agouti_path.py", line 114, in recover_untouched_sequences
preIndex = int(preGeneID.split('_')[1])
ValueError: invalid literal for int() with base 10: 'shred'
In the current version I have locally I don't have this line in my recover_untouched_sequences
anymore.
And your quick workaround:
if curCtg in dGFFs:
curCtgGeneModel5 = dGFFs[curCtg][0]
if not curCtgGeneModel5.fake:
curGeneID = curCtgGeneModel5.geneID
if curGeneID.split('_')[1] != "shred":
curIndex = int(curGeneID.split('_')[1])
My current version is:
if curCtg in dGFFs:
curCtgGeneModel5 = dGFFs[curCtg][0]
if not curCtgGeneModel5.fake:
tmpName = curCtgGeneModel5.geneID.split('_')
curGeneID = tmpName[:-1]
Could you git pull the lastest repo?
yep. restartign job right now with the newest version. Will let you know how it goes
Please run it with --debug just in case. If you type
python agouti.py --version
it should give you something like AGOUTI v0.3.3-dirty
.
its v 0.3.3 for me. is currently runnign in debug mode
So i had it run with the newest version and without the initial scaffold_1 underscore names. everything ran just fine
That's good to hear.
But I think it should also work for the initial scaffold_1
underscore names. AGOUTI appends the index at the end of the names, and it always read index from the last element split by underscore.
Hello Nico,
I am wondering does AGOUTI work for your scaffold with prefix scaffold_1
?
Simo
Hi Simo,
tried it on a test dataset and seemed to have worked. Maybe the old
agouti version was the only issue here
Cheers,
Nico
On 12.08.2016 19:58, Simo V Zhang wrote:
Hello Nico,
I am wondering does AGOUTI work for your scaffold with prefix
|scaffold_1|?Simo
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH9VMABUBuhNAnOKtJ0xPnnVJ8fNxz7uks5qfLRjgaJpZM4Jcf-v.
Nicolas Arning
MSc-Student
Evolutionary Bioinformatics Group
Institute for Evolution and Biodiversity
Westfälische Wilhelms-Universität
Room 100.19, Hüfferstrasse 1, 48149 Münster
Tel: +49-(0)251 - 83-21633
Group page: http://www.bornberglab.org
Hi Nico,
That's good to hear. Thanks for letting me know.
I am closing this thread now. Let me know if other issues pop out.
Simo