mourisl/Rascaf

Rascaf Error - Unknown Genome Name

Opened this issue · 12 comments

I am trying to improve the scaffolding on a genome assembled using Supernova and Chromium 10x. I've aligned my reads to the genome using hisat2 and used samtools to sort/convert to bam.

I ran rascaf with
rascaf -b my.aligned.bam -f my.scaffolds -k30 -o rascaff1

I got the error message:
Unknown genome name: scaffold4,190040,f7622z26696k7a0m100_f134738Z163344

grep "scaffold1,184907,f53044Z184907" my.scaffolds

scaffold1,184907,f53044Z184907

grep "scaffold4,190040,f7622z26696k7a0m100_f134738Z163344" my.aligned.bam

Nothing.

It looks like rascaf is failing because a scaffold exists in the genome, but not in my bam file, which makes sense, right? How do I get past this hurdle. I'd really like to incorporate my RNA-seq data into improving my genome.

Thanks in advance for your help!

-genome size ~700k

Rascaf searches the chromosome names in the BAM file header. Can you use "samtools view -H align.bam" to check whether the scaffold exists?

Did HISAT align the reads to the my.scaffolds fasta?

I used hisat2 to align my raw reads to my.scaffolds to produce my.aligned.bam.

scaffold4,190040,f7622z26696k7a0m100_f134738Z163344 is not present in my.aligned.bam. My best guess is I had no transcripts from that region?

Could you please show me the first few lines of "samtools view -H"?

head my.bam.samtoolsview.txt
@hd VN:1.0 SO:coordinate
@sq SN:scaffold1,184907,f53044Z184907 LN:184907
@sq SN:scaffold2,183271,f116433Z183271 LN:183271
@sq SN:scaffold3,168908,f16799Z168908 LN:168908
@sq SN:scaffold4,190040,f7622z26696k4a0m100_f134738Z163344 LN:190140
@sq SN:scaffold5,176065,r80173z25589k5a0m100_f144949Z150476 LN:176165
@sq SN:scaffold6,187668,f107299Z144999k4a0m100_f16915z42669 LN:187768
@sq SN:scaffold7,141766,f45237Z141766 LN:141766
@sq SN:scaffold8,189576,f6z51182k16a0m100_f108676Z138394 LN:189676
@sq SN:scaffold9,125692,f7651Z125692 LN:125692

It's strange that the scaffold name is complete in BAM file and is truncated in my.scaffolds. Has the my.scaffold file been processed by other methods after the alignment? Thank you.

I used samtools view to convert to a bam file and then used samtools sort to sort my bam file.

I tried renaming the scaffolds by numbers (1-x) and had the same issue with the scaffold being in the assembly file, but not the bam.

What was your HISAT command?

hisat2 -x snail -p 10 --very-sensitive -1 M9-S3_R1_001.fastq.gz -2 M9-S3_R2_001.fastq.gz \
-S ./align/my.aligned.sam --summary-file ./align/summary/M9-S3.txt --new-summary

samtools view -b -@ 8 my.aligned.sam | samtools sort -@ 8 > my.aligned.sort.bam

I just noticed you grep'ed scaffold1 in the my.scaffold file. Could you please grep "scaffold4,190040,f7622z26696k7a0m100_f134738Z163344" in the my.scaffold file? Thanks.

grep "scaffold4,190040,f7622z26696k7a0m100_f134738Z163344" my.scaffolds

scaffold4,190040,f7622z26696k7a0m100_f134738Z163344

It's there.

It looks like a bug. I need some time to look into this. I feel this could be due to the length of scaffold names because scaffold1,2,3 are not affected. If you don't want to wait for me, you can rename the scaffolds as scaffold1, scaffold2, without the content after comma. Then rebuild the hisat index, and then rerun HISAT and rascaf. Thank you.

I really appreciate your help. I'll give that a try and see how it works.