Error while recovering original scaffolding

Question

Error while recovering original scaffolding

Closed this issue 8 years ago · 14 comments

Hello again,
after last times succesful and quick bug handling (again, many thanks!), I've encountered a new error message while trying to recover the original scaffolding after shredding. My workflow is :

agouti.py shred -assembly genome.fasta -gff genome.gff -p scaffold

bwa index -p contigmapping -a bwtsw scaffold.ctg.fasta

bwa mem -t 80 -T 30 -P -k 20  contigmapping R1_reads.fastq R2_reads.fastq > mapping.ctg.sam

samtools view -Sb mapping.ctg.sam > mapping.ctg.bam

python agouti.py scaffold -assembly scaffold.ctg.fasta -bam mapping.ctg.bam -gff scaffold.ctg.gff -outdir ./run1 -shredpath scaffold.shred.info.txt

The error i get is

2016-08-04 06:36:27,146 - INFO - AGOUTI_PATH PROGRESS - Analyzing scaffolding paths
2016-08-04 06:36:27,148 - INFO - AGOUTI_PATH PROGRESS - [BEGIN] Reading file with shred info
2016-08-04 06:36:27,931 - ERROR - AGOUTI_PATH PROGRESS - Number of shredded contigs parsed: 242231
2016-08-04 06:36:27,931 - INFO - AGOUTI_PATH PROGRESS - [DONE]
2016-08-04 06:36:27,931 - INFO - AGOUTI_PATH PROGRESS - [BEGIN] Recovring original scaffolding
Traceback (most recent call last):
File "agouti.py", line 283, in
main()
File "agouti.py", line 280, in main
args.func(args)
File "agouti.py", line 219, in run_scaffolder
prefix)
File "src/agouti_path.py", line 38, in agouti_path_main
agPathDebug)
File "src/agouti_path.py", line 114, in recover_untouched_sequences
preIndex = int(preGeneID.split('_')[1])
ValueError: invalid literal for int() with base 10: 'shred'

I'm under the impression it tries to recover unshredded genes from the gff but somehow still gets the shredded ones and therefore has trouble with the ID. My workaround so far is:

`if curCtg in dGFFs:
    curCtgGeneModel5 = dGFFs[curCtg][0]
    if not curCtgGeneModel5.fake:
        curGeneID = curCtgGeneModel5.geneID
        if curGeneID.split('_')[1] != "shred":
            curIndex = int(curGeneID.split('_')[1])`

starting line 116 in agouti_path.py, but the rerun has not reached that phase yet, because it is still graph building. Which brings me to the question, whether there is a way to recover the outcome of

INFO - AGOUTI_SCAFFOLDING PROGRESS - Building graph from joining reads pairs

from a previous run. Hope this is not an error thrown by some stupid mistake in the command line from my side.

Cheers,
Nico

Answer 1 · 2016-08-04T09:21:44.000Z

oh and if it helps here is a line in the gff of an untouched gene

Scaffold22931 I5K gene 213 938 . - . ID=BGER028140;Name=Dmel-Tmp-reg-2;Dbxref=i5kP_homolog_AGAMB:AGAP004954-PA;Dbxref=i5kP_homolog_AMEL:GB46617-PA;Dbxref=i5kP_homolog_BMORI:BGIBMGA002078-PA;Dbxref=i5kP_homolog_CELEG:K01G5.10;Dbxref=i5kP_homolog_DMELA:FBpp0072460;Dbxref=i5kP_homolog_HSAPI:ENSP00000238379;Dbxref=i5kP_homolog_MMUSC:ENSMUSP00000036337;Dbxref=i5kP_homolog_TCAST:TC003205;Dbxref=i5kP_homolog_DMELA:FBpp0070645;

and this is of a shredded gene:

Scaffold1_0_0 I5K gene 7742 21351 . - . ID=agouti_shred_gene_1_0;

Answer 2 · 2016-08-04T14:05:36.000Z

Hello Nico,

I don't think the error came from your command line. This error should not happen with a fix I made last month. Hmm....

This line below seems strange:

Scaffold1_0_0 I5K gene 7742 21351 . - . ID=agouti_shred_gene_1_0;

Is the scaffold ID in your scaffold Fasta like "Scaffold1", and the AGOUTI shred gave your "Scaffold1_0_0"? I just checked my shred results, which was generated from the same Fasta and GFF file I believe. I get "Scaffold1_0" in the shredded Fasta and GFF files, rather than "Scaffold1_0_0".

And is it possible for you to send me the DEBUG file from this module (the file should be under agouti_path folder)? I saw you did not specify --debug in the command line. Could you please use it and re-run it?

Simo

Answer 3 · 2016-08-04T14:18:20.000Z

Thanks for taking care of it. If it helps, my original scaffold name was indeed scaffold1_0 because i had to manually cut out something in between and make the initial scaffold1 into scaffold1_0 and scaffold1_1. I could easily change the underscore to anything else if it messes with the program

Answer 4 · 2016-08-04T14:21:08.000Z

I might know where it goes wrong. Let me see if I can make a quick fix.

Meanwhile, I just updated my comment upstairs. Could you give a look and provide me the file, just in case the quick fix I am thinking fails. You can send the debug file to simozhan@indiana.edu

Answer 5 · 2016-08-04T14:23:39.000Z

I'll run the debug version and let you know when it finishes. Will take some time though, the genome is quite large and lots of reads

Answer 6 · 2016-08-04T14:35:46.000Z

Can I also ask you what version of AGOUTI you are using? I think you might use an older version.

File "src/agouti_path.py", line 114, in recover_untouched_sequences
preIndex = int(preGeneID.split('_')[1])
ValueError: invalid literal for int() with base 10: 'shred'

In the current version I have locally I don't have this line in my recover_untouched_sequences anymore.

And your quick workaround:

if curCtg in dGFFs:
    curCtgGeneModel5 = dGFFs[curCtg][0]
    if not curCtgGeneModel5.fake:
        curGeneID = curCtgGeneModel5.geneID
        if curGeneID.split('_')[1] != "shred":
            curIndex = int(curGeneID.split('_')[1])

My current version is:

if curCtg in dGFFs:
    curCtgGeneModel5 = dGFFs[curCtg][0]
    if not curCtgGeneModel5.fake:
        tmpName = curCtgGeneModel5.geneID.split('_')
        curGeneID = tmpName[:-1]

Could you git pull the lastest repo?

Answer 7 · 2016-08-04T14:42:58.000Z

yep. restartign job right now with the newest version. Will let you know how it goes

Answer 8 · 2016-08-04T14:47:56.000Z

Please run it with --debug just in case. If you type

python agouti.py --version

it should give you something like AGOUTI v0.3.3-dirty.

Answer 9 · 2016-08-04T15:02:56.000Z

its v 0.3.3 for me. is currently runnign in debug mode

Answer 10 · 2016-08-08T11:00:25.000Z

So i had it run with the newest version and without the initial scaffold_1 underscore names. everything ran just fine

Answer 11 · 2016-08-08T15:04:00.000Z

That's good to hear.

But I think it should also work for the initial scaffold_1 underscore names. AGOUTI appends the index at the end of the names, and it always read index from the last element split by underscore.

Answer 12 · 2016-08-12T17:58:58.000Z

Hello Nico,

I am wondering does AGOUTI work for your scaffold with prefix scaffold_1?

Simo

Answer 13 · 2016-08-15T14:14:10.000Z

Hi Simo,
tried it on a test dataset and seemed to have worked. Maybe the old
agouti version was the only issue here

Cheers,
Nico

On 12.08.2016 19:58, Simo V Zhang wrote:

Hello Nico,

I am wondering does AGOUTI work for your scaffold with prefix
|scaffold_1|?

Simo

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH9VMABUBuhNAnOKtJ0xPnnVJ8fNxz7uks5qfLRjgaJpZM4Jcf-v.

Nicolas Arning
MSc-Student
Evolutionary Bioinformatics Group
Institute for Evolution and Biodiversity
Westfälische Wilhelms-Universität
Room 100.19, Hüfferstrasse 1, 48149 Münster
Tel: +49-(0)251 - 83-21633

Group page: http://www.bornberglab.org

Answer 14 · 2016-08-15T14:31:35.000Z

Hi Nico,

That's good to hear. Thanks for letting me know.

I am closing this thread now. Let me know if other issues pop out.

Simo