bcgsc/arks

Uncertain about how to get BX tag from longranger

Closed this issue · 3 comments

Hi,

Thanks in advance for any help. I'm running arks 1.0.2, but my question is in reference to documentation on generating the interleaved input using longranger. I tried running longranger (v 2.2.2) using this command

longranger basic --id=ID --fastqs=/path/to/10x/fastqs

Where my fastq files are:

sample_S1_L003_I1_001.fastq.gz
sample_S1_L003_R1_001.fastq.gz
sample_S1_L003_R2_001.fastq.gz

And the output (barcoded.fastq.gz) is missing a BX tag

e.g.,

zcat barcoded.fastq.gz | head

@A00127:62:H5YY7DSXX:3:1369:9200:5462
AAGAAAGAAAGAAAGGGGATTGGTTACCAGGAAGAATAGAGGAAAGAGGTGGAAAGAATGGGGAAAAGGCAGGAGGGAGGAAAGGAGGGGTGCGACACTTCTCAGAATACACATTTCTGCATAACTC
+
FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The sequencing center ran this command to generate the fastq files

module load bcl2fastq2
module load longranger
longranger mkfastq --run=$RP \
        --csv=samplesheet.csv \
        --use-bases-mask=Y150n,I8nn,n*,Y150n \
        --ignore-dual-index > longranger_fq.out 2>&1

Perhaps the lack of the BX tag is due to me using the processed reads, but it looks like longranger basic should be run on the demultiplexed reads from mkfastq.

I tried running calcBarcodeMultiplicities.pl on barcoded.fastq.gz, which produces an output that looks like this:

head read_multiplicities.csv 

CGCTTCAGTACAGCAG-1,99144
CGGACTGCACCTCGTT-1,91144
CAGAATCCACGCATCG-1,81924

but when arks is run like this:

arks -p full -f draft_assembly.fasta read_multiplicities.csv barcoded.fastq.gz

it dies with this error:

Reading user inputs...
Finished reading user inputs...entering runArks()...
Entered runArks()...
Running: arks 1.0.2
 pid 14326
 -p full
 -f draft_assembly.fasta
 -a 
 -q 
 -w 
 -i 
 -o 0
 -c 5
 -k 30
 -g 1
 -j 0.55
 -l 0
 -z 500
 -b draft_assembly.fasta.scaffk-method_c5_k30_g1_j0.55_l0_d0_e30000_r0.05
 Min index multiplicity: 50
 Max index multiplicity: 10000
 -d 0
 -e 30000
 -r 0.05
 -t 1
 -v 0

---We are using KMER method.---


=>Preprocessing: Gathering barcode multiplicity information...Tue Oct 16 09:55:25 2018
Could not open . --fatal.

I'm guessing that this error is simply tied to not correctly generating the barcoded.fastq.gz file correctly. Do you have any suggestions? Thanks! Zack

Hi Zack,

Have you looked further into your barcoded.fastq.gz? Because the output from longranger basic is sorted by barcode, I often do find that the first few reads in that file don't have an associated barcode (Likely due to sequencing error, so the program is unable to assign a barcode on the whitelist).

For example, if I head into the reads file used for the NA12878 human tests in the ARKS paper:

[lcoombe@lcoombe01 outs]$ gunzip -c barcoded.fastq.gz |head -n 8
@E00247:267:HMVT3CCXX:1:2202:19329:3823
AAAGGAGGGAGGAAGGAAGGAGGGAAAGAAAGAGAAAGAAAGAAAAAGAAAGAGAGAGAGAGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAAAGAGAGAAAG
+
KKKKKKKKKFKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKKKKKKKKFKKKKKKK7FKKKKAFKKKFKKKKKKKKAKAA7F,<F<<<7<A
@E00247:267:HMVT3CCXX:1:2202:19329:3823
CTGTGATCAATTAAGCAGCTGACCAGTCGTTACCCGCTCCTCCCTGCTCTTGCTACCCAATAAATACGAAGGGCTGTAGAAACTCAGGGTGGCTGCTGCCTTTGCTCACTAGAAGCAGGGAGCCCTTTTCTTCTTCCCCTGGCCCCTTCCT
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKAKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKFFAKFKKKKKKKKKKAKKKFFFF7FFKKFK

But, there are associated barcodes when I look further into the file:

[lcoombe@lcoombe01 outs]$ gunzip -c barcoded.fastq.gz |grep -n "BX" |head -n 4
8641:@E00247:267:HMVT3CCXX:2:1207:19522:61714 BX:Z:AAACACCAGACAATAC-1
8645:@E00247:267:HMVT3CCXX:2:1207:19522:61714 BX:Z:AAACACCAGACAATAC-1
8649:@E00247:267:HMVT3CCXX:5:2110:13210:41638 BX:Z:AAACACCAGACAATAC-1
8653:@E00247:267:HMVT3CCXX:5:2110:13210:41638 BX:Z:AAACACCAGACAATAC-1

You are getting that error because you need to specify the read_multiplicities.csv file that you generated with -a:

arks -p full -f draft_assembly.fasta -a read_multiplicities.csv barcoded.fastq.gz

Hope that helps!
Lauren

@lcoombe Thanks for your prompt response. You are correct. The BX tags are there for some reads. I wasn't aware that they would only be there in a subset. The -a flag seems to have done the trick! Thanks for spotting it! Okay to close. I'll open another issue if something else comes up.

@zrlewis Sounds good - I'm glad that fixed the error!