bcgsc/arks

no changes after scaffolding with arks-tigmint-links pipeline

Closed this issue ยท 9 comments

Hi,

I used Canu to produce a draft genome and then ran the arks-tigmint-links pipeline with following commands.

TOOLDIR=/home/niuyw/software

PPN=20

# links
export PATH=/home/niuyw/software/links_v1.8.6:$PATH

# run2, canu
ln -s /home/zhangll/Tasks/Gouqi/10Xgenomic/longranger/CHROMIUM_interleaved.fq.gz
myReads=CHROMIUM_interleaved

ln -s /parastor300/niuyw/Project/Goqi_genome_180207/canu/run1/goqi.contigs.fasta canu.fa
myDraft=canu

$TOOLDIR/arks.1.0.3/Examples/arks-make arks-tigmint draft=${myDraft} reads=${myReads} threads=$PPN

I did not see any error messages in the logs, but the final genome was the same as the contigs input.

$ python assembly_stats.py canu.fa
Size_includeN	2830901440
Size_withoutN	2830901440
Seq_Num	16206
Mean_Size	174682
Median_Size	63595
Longest_Seq	21037189
Shortest_Seq	1001
GC_Content (%)	37.97
N50	599780
L50	871
N90	57129
Gap (%)	0.0

$ python assembly_stats.py canu.tigmint.renamed.fa 
Size_includeN	2830901440
Size_withoutN	2830901440
Seq_Num	16206
Mean_Size	174682
Median_Size	63595
Longest_Seq	21037189
Shortest_Seq	1001
GC_Content (%)	37.97
N50	599780
L50	871
N90	57129
Gap (%)	0.0

$ python assembly_stats.py canu.tigmint_c5_m50-10000_k30_r0.05_e30000_z500_l5_a0.3.scaffolds.fa
Size_includeN	2830901440
Size_withoutN	2830901440
Seq_Num	16206
Mean_Size	174682
Median_Size	63595
Longest_Seq	21037189
Shortest_Seq	1001
GC_Content (%)	37.97
N50	599780
L50	871
N90	57129
Gap (%)	0.0

Here is the logs: arks_run2.txt

Do you have ideas about this?

Bests,
Yiwei Niu

Hi @YiweiNiu,

=>Preprocessing: Gathering barcode multiplicity information...Thu May 16 10:13:51 2019
Saw 0 barcodes and keeping 0 read pairs out of 0

Is the barcode in the form BX:Z:<barcode> in the fastq read headers? Was tigmint able to successfully make some cuts in your assembly? This log makes me wonder if there is something off with where the chromium barcodes are in your reads file.
Is the output graph file (*original.gv) empty or does it have edges?

Lauren

Hi Lauren,

Thank you for your quick reply.

This should be the problem. I did not see BX:Z:<barcode> in my fastq.

$ zcat CHROMIUM_interleaved.fq.gz |head -3
@ST-E00575:142:H7J7FCCXY:4:1101:18761:66584_AAACACCAGACAATAC
CGACTTTGTCCTATATCACAAGTGTGCTTTGAGTTGAATTTTGTGATTTCCCACAGATCATCAGGCTTAACAATTCCCCGAAGAAACCATCGACAGCCTTGAGTCTGTCGCTTACAAACCAGCCTCC
+
niuyw@admin:~/Project/Goqi_genome_180207/arks/run2$ zcat CHROMIUM_interleaved.fq.gz |head -4
@ST-E00575:142:H7J7FCCXY:4:1101:18761:66584_AAACACCAGACAATAC
CGACTTTGTCCTATATCACAAGTGTGCTTTGAGTTGAATTTTGTGATTTCCCACAGATCATCAGGCTTAACAATTCCCCGAAGAAACCATCGACAGCCTTGAGTCTGTCGCTTACAAACCAGCCTCC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

And the *_original.gv is empty.

$ cat canu.tigmint_c5_m50-10000_k30_r0.05_e30000_z500_original.gv
graph G {
}

I got this CHROMIUM_interleaved.fq.gz by running the following commands.

/home/zhangll/software/longranger-2.2.0/longranger basic --id=fastq_Convert0319 --fastqs=/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK1521,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2007,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2008,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2009

gzip -c -k CHROMIUM_interleaved.fastq > CHROMIUM_interleaved.fq.gz

Do you know why the final CHROMIUM_interleaved.fq.gz does not have any barcode infor in the headers?

Bests,
Yiwei Niu

Hi @YiweiNiu,

Have you (or anyone else in your group) used these reads previously for ARCS? It looks to me that the output file from longranger basic has been reformatted so that the barcode has been appended to the read header in the form @read-header_<barcode>. For earlier versions of ARCS, the barcode was required in this format, but now it can see a barcode in the BX:Z: tag of the BAM file. ARKS only looks for a barcode in the BX:Z: comment of the fastq read header. That longranger command looks fine otherwise.
Since ARKS couldn't find the barcodes, that is why you got no scaffolding and the graph file (*_original.gv) is empty.
To solve this, you could run longranger basic again, or manually reformat your reads to put the barcode back in the BX:Z: tag.

Hope that helps!
Lauren

Yes. Another guy in the lab used this CHROMIUM_interleaved.fastq to run ARCS. I did not know whether this file had been reformatted. I would run longranger basic again.

Thank you! Your input is really helpful.

Excellent - Glad I could help!

Hi Lauren,

Sorry to bother you again.

I ran longranger basic again. But I still did not see BX:Z:<barcode> in fastq headers.

The commands I used was the same as that of before.

/home/zhangll/software/longranger-2.2.0/longranger basic --id=fastq_Convert0628 --fastqs=/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK1521,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2007,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2008,/home/zhangll/Tasks/Gouqi/data/10X/Cleandata/NDHX00134-AK2009

The data is from one sample, and the fastqs I got are organized like this. The company gave us one Rawdata and one Cleandata.

โ”œโ”€โ”€ Cleandata
โ”‚   โ”œโ”€โ”€ NDHX00134-AK1521
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R1_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R2_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L005_R1_001.fastq.gz
โ”‚   โ”‚   โ””โ”€โ”€ NDHX00134-AK_S1_L005_R2_001.fastq.gz
โ”‚   โ”œโ”€โ”€ NDHX00134-AK2007
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R1_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R2_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L005_R1_001.fastq.gz
โ”‚   โ”‚   โ””โ”€โ”€ NDHX00134-AK_S1_L005_R2_001.fastq.gz
โ”‚   โ”œโ”€โ”€ NDHX00134-AK2008
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R1_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R2_001.fastq.gz
โ”‚   โ”‚   โ”œโ”€โ”€ NDHX00134-AK_S1_L005_R1_001.fastq.gz
โ”‚   โ”‚   โ””โ”€โ”€ NDHX00134-AK_S1_L005_R2_001.fastq.gz
โ”‚   โ””โ”€โ”€ NDHX00134-AK2009
โ”‚       โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R1_001.fastq.gz
โ”‚       โ”œโ”€โ”€ NDHX00134-AK_S1_L004_R2_001.fastq.gz
โ”‚       โ”œโ”€โ”€ NDHX00134-AK_S1_L005_R1_001.fastq.gz
โ”‚       โ””โ”€โ”€ NDHX00134-AK_S1_L005_R2_001.fastq.gz
โ””โ”€โ”€ Rawdata
    โ”œโ”€โ”€ NDHX00134-AK1521
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK1521_L4_1.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK1521_L4_2.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK1521_L5_1.fq.gz
    โ”‚   โ””โ”€โ”€ NDHX00134-AK1521_L5_2.fq.gz
    โ”œโ”€โ”€ NDHX00134-AK2007
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2007_L4_1.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2007_L4_2.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2007_L5_1.fq.gz
    โ”‚   โ””โ”€โ”€ NDHX00134-AK2007_L5_2.fq.gz
    โ”œโ”€โ”€ NDHX00134-AK2008
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2008_L4_1.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2008_L4_2.fq.gz
    โ”‚   โ”œโ”€โ”€ NDHX00134-AK2008_L5_1.fq.gz
    โ”‚   โ””โ”€โ”€ NDHX00134-AK2008_L5_2.fq.gz
    โ””โ”€โ”€ NDHX00134-AK2009
        โ”œโ”€โ”€ NDHX00134-AK2009_L4_1.fq.gz
        โ”œโ”€โ”€ NDHX00134-AK2009_L4_2.fq.gz
        โ”œโ”€โ”€ NDHX00134-AK2009_L5_1.fq.gz
        โ””โ”€โ”€ NDHX00134-AK2009_L5_2.fq.gz

The header of "Rawdata" and "Cleandata" fastqs look like this:

$ zcat Rawdata/NDHX00134-AK1521/NDHX00134-AK1521_L4_1.fq.gz | head -4
@ST-E00575:142:H7J7FCCXY:4:1101:7202:1872 1:N:0:CGCGATAC
CNATCTAGTGTTTAGCTTAGGGTTCCCCCATCTTCCCTTTATTTACATCCTGTTTTTAATAATATATCCTCCATGCACTAAGGAGTAGGGATGGAAACTTGATATAAGAAAATAAAAAATAAAAAAAAATTCCTAAACCACGTTTATATG
+
<#AFAJJJJJJJJJJJ7A-7AJAJ<<-A7<--<<7<7FJJJA-JFFJJJJJFFJ-<FFJJJJJJJJF--7<-7<<A--7FAJ7AA-AJ<J-7AJJ-A-A<F-7AFJJJ-77AA<FFJF-7-7F7--<A-7AAFFJFA--A-<)7-7-A-F

$ zcat Cleandata/NDHX00134-AK1521/NDHX00134-AK_S1_L004_R1_001.fastq.gz | head -4
@ST-E00575:142:H7J7FCCXY:4:1101:7202:1872 1:N:0:CGCGATAC
CNATCTAGTGTTTAGCTTAGGGTTCCCCCATCTTCCCTTTATTTACATCCTGTTTTTAATAATATATCCTCCATGCACTAAGGAGTAGGGATGGAAACTTGATATAAGAAAATAAAAAATAAAAAAAAATTCCTAAACCACGTTTATATG
+
<#AFAJJJJJJJJJJJ7A-7AJAJ<<-A7<--<<7<7FJJJA-JFFJJJJJFFJ-<FFJJJJJJJJF--7<-7<<A--7FAJ7AA-AJ<J-7AJJ-A-A<F-7AFJJJ-77AA<FFJF-7-7F7--<A-7AAFFJFA--A-<)7-7-A-F

After browsing this page, I guess there is something wrong with the names of the fastqs, but I do not know how to set them properly.

Do you have any ideas about this? Or I have to ask the company about how they preprocess the fastqs?

Bests,
Yiwei Niu

Hi @YiweiNiu,

What does your output from longranger basic look like? Did you check more than just the head for the BX:Z: tag? I just ask because longranger basic sorts the output file by barcode by default, so you will have to look further in the file to find the barcodes.
Are each of those different folders different libraries? Or from the same chromium library prep?

Lauren

Dear Lauren,

I checked the other part of the output file barcoded.fastq.gz and found BX:Z tags. So, I guess this file would be fine for arks.

$ zcat fastq_Convert0319/outs/barcoded.fastq.gz | wc -l
5997361952

$ zcat fastq_Convert0319/outs/barcoded.fastq.gz | grep -c 'BX:Z'
1437330904

Previously I thought each record in the fastq would have BX:Z tag. Sorry for my silly question.

Thank you very much for your kindness and patience.

Best wishes,
Yiwei Niu

@YiweiNiu - Yes that should be fine for ARKS!

No worries at all -- that is a common question that comes up in our group quite a lot as well.

Thank you again for your interest in ARKS!
Lauren