We need to demultiplex, but want groups of barcodes to be joined into a single FASTQ file. And we want it to be easy.
This will demultiplex FASTQs using fastq-multx
(conda install -c bioconda fastq-multx
) then cat
them into grouped
FASTQs rather than individual samples. The grouped FASTQs are
validated by total count against the individual sample sum.
This effectively subsets 'Undetermined' into smaller
groups of 'Undetermined' files.
Setting --output-action
to "groupid" and running:
$ gdemux -a groupid -o out test_R1.fastq test_barcodes.txt
[2016-08-03 17:00 INFO] Found 10 samples across 3 groups within test_barcodes.txt
[2016-08-03 17:00 INFO] Demultiplexing (mismatches=1, distance=2, quality=0)
[2016-08-03 17:00 INFO] Joining reads across groups
[2016-08-03 17:00 INFO] Validating group read counts with sample counts
[2016-08-03 17:00 INFO] Processing complete
$ tree out
out
├── group1_I1.fastq
├── group1_R1.fastq
├── group1_R2.fastq
├── group2_I1.fastq
├── group2_R1.fastq
├── group2_R2.fastq
├── group3_I1.fastq
├── group3_R1.fastq
└── group3_R2.fastq
Setting --output-action
to "undetermined" and running:
$ gdemux -a undetermined -o out test_R1.fastq test_barcodes.txt
[2016-08-03 17:01 INFO] Found 10 samples across 3 groups within test_barcodes.txt
[2016-08-03 17:01 INFO] Demultiplexing (mismatches=1, distance=2, quality=0)
[2016-08-03 17:01 INFO] Joining reads across groups
[2016-08-03 17:01 INFO] Validating group read counts with sample counts
[2016-08-03 17:01 INFO] Processing complete
$ tree out
out
├── group1
│ ├── Undetermined_I1.fastq
│ ├── Undetermined_R1.fastq
│ └── Undetermined_R2.fastq
├── group2
│ ├── Undetermined_I1.fastq
│ ├── Undetermined_R1.fastq
│ └── Undetermined_R2.fastq
└── group3
├── Undetermined_I1.fastq
├── Undetermined_R1.fastq
└── Undetermined_R2.fastq
groupid | barcode |
---|---|
group1 | AAGGCGCTCCTT |
group1 | GATCTAATCGAG |
group1 | CTGATGTACACG |
group2 | ACGTATTCGAAG |
group2 | GACGTTAAGAAT |
group2 | TGGTGGAGTTTC |
group3 | TTAACAAGGCAA |
group3 | AACCGCATAAGT |
group3 | CCACAACGATCA |
group3 | AGTTCTCATTAA |
Extra columns can exist and differing column names can be used though they will need to be specified on the command line as --group-id
and --barcode
.
A header isn't necessary either, though you'll need to specify more options. --no-header
will be necessary, along with 0-based integers for the 3 columns, e.g. --no-header --group-id 0 --barcode 1
.